[
https://issues.apache.org/jira/browse/NUTCH-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Asitang Mishra updated NUTCH-2091:
----------------------------------
Priority: Major (was: Minor)
> Increase robustness and crawling versatility of Nutch for the Deep Web
> ----------------------------------------------------------------------
>
> Key: NUTCH-2091
> URL: https://issues.apache.org/jira/browse/NUTCH-2091
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.10
> Reporter: Asitang Mishra
> Labels: memex, nutch
>
> Nutch fails to grab a page or crawl in a manner that is more productive in
> certain cases. This issue is to discuss those specific cases and try to
> generalize them into Nutch to make it even more robust and productive.
> I came across three websites and got many issues. I have toned down those
> issues into fine points.
> 1. Some websites detect that the crawler is not a browser (marketwired)
> (cookie validations) and send it to the first page again and again.
> 2. Some data behind a click (detect which clicks: javascript void) of 'a tag'
> that is not a link exactly (an improvement for the selenium plugin)
> 3. When clicked something on a page and the page changed, how to get back the
> page before clicking further (can’t obviously look for a back button or cross
> button. Can save the old state juxtapose with new info and only take the
> extra info)
> 4. Differentiate between a navigation link and a common link in a forum page
> so that both links can be used differently to decide the progress of the
> crawler (nav links decide the rounds and other links we can go one round)
> 5. Bring the capability of changing # to ? (pataxia.com). Right now url
> normalization completely removes the part after # thinking that it's a simple
> anchor tag.
> 6. Easy route-decision in property file to decide how the fetcher will behave
> (instead of going all BFS or DFS, there should be a away to make it go
> DEPTH-LIMITED search. Esp good for forums and the likes of it. And users can
> give some known inputs like depth etc. to direct the crawler if they know
> something specific about the site)
> 7. A forum can be roughly generalized into: a meta topic page (no nav links)
> -> post list (with nav links) -> post page (with nav links) : How to make
> nutch aware of this structure/heirachy. If manually give simple clues as
> well. Can be seen as an extension of the last point.
> 8. Sometimes even nav links are not actual links but ajax requests.
> NOTE: Nav links (definition here): the structure on a web page (like a forum)
> which gives us an option to go to various pages by numbers or next, previous,
> first and or last pages.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)