[jira] Commented: (NUTCH-205) Wrong 'fetch date' for non available pages

M.Oliver Scheele (JIRA) Fri, 10 Feb 2006 05:51:18 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-205?page=comments#action_12365892 ]


M.Oliver Scheele commented on NUTCH-205:
----------------------------------------

Here's an easy to reproduce example for the issue above:

1.)  Install Nutch 0.7.1

2.)  To simulate a slow-network/timeout set the following parameter in you 
"crawl-tool.xml":
<property>
  <name>http.max.delays</name>
  <value>1</value>
</property>
<property>
  <name>fetcher.threads.per.host</name>
  <value>1</value>
</property>

3.)  Prepare your crawl as described in the tutorial and let it run:
> nutch crawl urls.txt -dir crawl_data  -depth 4

4.) While the crawl is running you notice some erros like this:
> fetch of http://www.webpage.de/mypage.html failed with: java.lang.Exception: 
> org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
[Note: That's ok, as the message said the page would be recrawled next time]

5.)  After crawling has finished check the web-db with the following command:
> nutch readdb crawl_data/db/ -dumppageurl
You will see something like this:
> Page 5: Version: 4
> URL: http://www.webpage.de/mypage.html
> ID: 21c1e01ff46ec28c0af6da63512e43bf
> Next fetch: Sun Aug 17 07:12:55 CET 292278994
> Retries since fetch: 0
> Retry interval: 1 days

--> As you see the page would be refetched not before the year 292278994. A 
"retry later" wouldn't be done and the page will never appear in a nutch search 
result. :(


> Wrong 'fetch date' for non available pages
> ------------------------------------------
>
>          Key: NUTCH-205
>          URL: http://issues.apache.org/jira/browse/NUTCH-205
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7, 0.7.1
>  Environment: JDK 1.4.2_09 / Windows 2000 / Using standard Nutch-API
>     Reporter: M.Oliver Scheele

>
> Web-Pages that couldn't be fetched because of a time-out wouldn't be 
> refetched anymore.
> The next fetch in the web-db is set to Long.max.
> Example:
> -------------
> While fetching our URLs, we got some errors like this:
> 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html  
> failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: 
> Exceeded ttp.max.delays: retry later.
> That seems to be ok and indicates some network problems.
> The problem is that the entry in the Webdb shows the following:
> Page 4: Version: 4
> URL: http://www.test-domain.de/crawl_html/page_2.html
> ID: b360ec931855b0420776909bd96557c0
> Next fetch: Sun Aug 17 07:12:55 CET 292278994
> Retries since fetch: 0
> Retry interval: 0 days
> The 'Next fetch' date is set to the year '292278994'.
> Probably I wouldn't be able to see the refetch alive. ;)
> A page that couldn't be crawled because of networks-problems,
> should be refetched with the next crawl (== set next fetch date current time 
> + 1h).
> Possible Bug-Fixing:
> ----------------------------
> When updating the web-db the method updateForSegment() in the 
> UpdateDatabaseTool.class,
> set the fetch-date always to Long.max for any (unknown) exception during 
> fetching.
> The RETRY status is not always set correctly.
> Change the following lines:
> } else if (fo.getProtocolStatus().getCode() == ProtocolStatus.RETRY &&
>                        page.getRetriesSinceFetch() < MAX_RETRIES) {
>               pageRetry(fo);                      // retry later
>             } else {
>               pageGone(fo);                       // give up: page is gone
>             }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-205) Wrong 'fetch date' for non available pages

Reply via email to