Hi, I am trying to figure out why some of html pages didn't get crawled and it seems to me that there may be some issues in Nutch.
I believe that the following parameters are the most important in my case when fatching pages with nutch0.7: <property> <name>http.timeout</name> <value>10000</value> <description>The default network timeout, in milliseconds.</description> </property> <property> <name>http.max.delays</name> <value>3</value> <description>The number of times a thread will delay when trying to fetch a page. Each time it finds that a host is busy, it will wait fetcher.server.delay. After http.max.delays attepts, it will give up on the page for now.</description> </property> <property> <name>fetcher.server.delay</name> <value>5.0</value> <description>The number of seconds the fetcher will delay between successive requests to the same server.</description> </property> If I understand it corrrectly then nutch should try to fetch page at least three times where there shouldn't be less then 5 seconds between individual attempts. However if I look into crawl log file I can see that one particular page didn't get in index due to there are only two error messages: error #1: 050913 113818 fetching http://xxxx_some_page_xxxx.html org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. at org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133) at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135) error#2: 050913 113959 fetch of http://xxxx_some_page_xxxx.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later I would expect that after these two exceptions there should be one other record; either it be next error or information that page has been successfully fetched. But no other message in log file can be found and page is NOT in index after fetching is finished. Can anyone explain me what I am understanding wrong? Regards, Lukas
