[jira] Commented: (NUTCH-205) Wrong 'fetch date' for non available pages

Andrzej Bialecki (JIRA) Tue, 07 Feb 2006 05:12:19 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-205?page=comments#action_12365434 ]


Andrzej Bialecki  commented on NUTCH-205:
-----------------------------------------

This is a design choice, not a bug. The errors you see are due to improper 
configuration - some threads cannot access the host for a long time, because of 
the limit of concurrent requests to a single host. Please see 
"fetcher.threads.per.host" and "http.max.delays" config properties.

> Wrong 'fetch date' for non available pages
> ------------------------------------------
>
>          Key: NUTCH-205
>          URL: http://issues.apache.org/jira/browse/NUTCH-205
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7, 0.7.1
>  Environment: JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API
>     Reporter: M.Oliver Scheele

>
> Web-Pages that couldn't be fetched because of a time-out wouldn't be 
> refetched anymore.
> The next fetch in the web-db is set to Long.max.
> Example:
> -------------
> While fetching our URLs, we got some errors like this:
> 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html  
> failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
> Exceeded ttp.max.delays: retry later.
> That seems to be ok and indicates some network problems.
> The problem is that the entry in the Webdb shows the following:
> Page 4: Version: 4
> URL: http://www.test-domain.de/crawl_html/page_2.html
> ID: b360ec931855b0420776909bd96557c0
> Next fetch: Sun Aug 17 07:12:55 CET 292278994
> Retries since fetch: 0
> Retry interval: 0 days
> The 'Next fetch' date is set to the year '292278994'.
> Probably I wouldn't be able to see the refetch alive. ;)
> A page that couldn't be crawled because of networks-problems,
> should be refetched with the next crawl (== set next fetch date current time 
> + 1h).
> Possible Bug-Fixing:
> ----------------------------
> When updating the web-db the method updateForSegment() in the 
> UpdateDatabaseTool.class,
> set the fetch-date always to Long.max for any (unknown) exception during 
> fetching.
> The RETRY status is not always set correctly.
> Change the following lines:
> } else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
>                        page.getRetriesSinceFetch() < MAX_RETRIES) {
>               pageRetry(fo);                      // retry later
>             } else {
>               pageGone(fo);                       // give up: page is gone
>             }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-205) Wrong 'fetch date' for non available pages

Reply via email to