Thank you for replying Description of the patch 629, it purges hosts if download speed is too low ( speed limit, number of pages minimum fetched and among of pages remaining) or if there are too many errors (percentage and among of pages fetched (successfully or not))
I think that the pach 769 is less precise about the occuring errors. example : fetcher.max.exceptions.per.queue = 35 If we have 40 pages dead (404) in the 400 pages of a host given, the host would be purges wheras there were only 10% of dead pages So, We would increase fetcher.max.exceptions.per.queue. However, In the case of a unknown-host, we would lose much time ... I think that, it's better either to change fetcher.max.exceptions.per.queue into a percentage or to keep it absolute and say that the among of error allowed have to be reach in a ruch. Your patch 770 is quite good Thanks Louis 2010/7/29 Julien Nioche (JIRA) <[email protected]> > > [ > https://issues.apache.org/jira/browse/NUTCH-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893684#action_12893684] > > Julien Nioche commented on NUTCH-629: > ------------------------------------- > > The 2 features below have been added to 1.1 and provide something > comparable > > https://issues.apache.org/jira/browse/NUTCH-769 : Fetcher to skip queues > for URLS getting repeated exceptions > https://issues.apache.org/jira/browse/NUTCH-770 : Timebomb for Fetcher > > > > Detect slow and timeout servers and drop their URLs > > --------------------------------------------------- > > > > Key: NUTCH-629 > > URL: https://issues.apachost if download speed is too > low or if there are > te.org/jira/browse/NUTCH-629<https://issues.apache.org/jira/browse/NUTCH-629> > > Project: Nutch > > Issue Type: Improvement > > Components: fetcher > > Reporter: Otis Gospodnetic > > Assignee: Otis Gospodnetic > > Attachments: NUTCH-629.patch > > > > > > Fetch jobs will finish faster if we find a way to prevent servers that > are either slow or time out from slowing down the whole process. > > I'll attach a patch that counts per-server exceptions and timeouts and > tracks download speed per server. > > Queues/sservers that exceed timeout or download thresholds are marked as > "tooManyErrors" or "tooSlow". Once they get marked as such, all of their > subsequent URLs get dropped (i.e. they do not fetched) and marked GONE. > > At the end of the fetch task, stats for each server processed are > printed. > > Also, I believe the per-host/domain/TLD/etc. DB from NUTCH-628 would be > the right place to add server data collected by this patch. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >

