Wrong 'fetch date' for non available pages
------------------------------------------

         Key: NUTCH-205
         URL: http://issues.apache.org/jira/browse/NUTCH-205
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7, 0.7.1    
 Environment: JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API
    Reporter: M.Oliver Scheele


Web-Pages that couldn't be fetched because of a time-out wouldn't be refetched 
anymore.
The next fetch in the web-db is set to Long.max.

Example:
-------------
While fetching our URLs, we got some errors like this:
60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html  failed 
with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded 
ttp.max.delays: retry later.

That seems to be ok and indicates some network problems.

The problem is that the entry in the Webdb shows the following:

Page 4: Version: 4
URL: http://www.test-domain.de/crawl_html/page_2.html
ID: b360ec931855b0420776909bd96557c0
Next fetch: Sun Aug 17 07:12:55 CET 292278994
Retries since fetch: 0
Retry interval: 0 days

The 'Next fetch' date is set to the year '292278994'.
Probably I wouldn't be able to see the refetch alive. ;)

A page that couldn't be crawled because of networks-problems,
should be refetched with the next crawl (== set next fetch date current time + 
1h).


Possible Bug-Fixing:
----------------------------

When updating the web-db the method updateForSegment() in the 
UpdateDatabaseTool.class,
set the fetch-date always to Long.max for any (unknown) exception during 
fetching.
The RETRY status is not always set correctly.

Change the following lines:

} else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
                       page.getRetriesSinceFetch() < MAX_RETRIES) {

              pageRetry(fo);                      // retry later

            } else {
              pageGone(fo);                       // give up: page is gone
            }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to