[ 
https://issues.apache.org/jira/browse/NUTCH-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asitang Mishra updated NUTCH-2091:
----------------------------------
    Priority: Major  (was: Minor)

> Increase robustness and crawling versatility of Nutch for the Deep Web
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-2091
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2091
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.10
>            Reporter: Asitang Mishra
>              Labels: memex, nutch
>
> Nutch fails to grab a page or crawl in a manner that is more productive in 
> certain cases. This issue is to discuss those specific cases and try to 
> generalize them into Nutch to make it even more robust and productive.
> I came across three websites and got many issues. I have toned down those 
> issues into fine points.
> 1. Some websites detect that the crawler is not a browser (marketwired) 
> (cookie validations) and send it to the first page again and again.
> 2. Some data behind a click (detect which clicks: javascript void) of 'a tag' 
> that is not a link exactly (an improvement for the selenium plugin)
> 3. When clicked something on a page and the page changed, how to get back the 
> page before clicking further (can’t obviously look for a back button or cross 
> button. Can save the old state juxtapose with new info and only take the 
> extra info)
> 4. Differentiate between a navigation link and a common link in a forum page 
> so that both links can be used differently to decide the progress of the 
> crawler (nav links decide the rounds and other links we can go one round)
> 5. Bring the capability of changing # to ? (pataxia.com). Right now url 
> normalization completely removes the part after # thinking that it's a simple 
> anchor tag.
> 6. Easy route-decision in property file to decide how the fetcher will behave 
> (instead of going all BFS or DFS, there should be a away to make it go 
> DEPTH-LIMITED search. Esp good for forums and the likes of it. And users can 
> give some known inputs like depth etc. to direct the crawler if they know 
> something specific about the site)
> 7. A forum can be roughly generalized into: a meta topic page (no nav links) 
> -> post list (with nav links) -> post page (with nav links) : How to make 
> nutch aware of this structure/heirachy. If manually give simple clues as 
> well. Can be seen as an extension of the last point.
> 8. Sometimes even nav links are not actual links but ajax requests.
> NOTE: Nav links (definition here): the structure on a web page (like a forum) 
> which gives us an option to go to various pages by numbers or next, previous, 
> first and or last pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to