Hi,
I am trying to figure out why some of html pages didn't get crawled
and it seems to me that there may be some issues in Nutch.

I believe that the following parameters are the most important in my
case when fatching pages with nutch0.7:

<property>
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>3</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

<property>
  <name>fetcher.server.delay</name>
  <value>5.0</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

If I understand it corrrectly then nutch should try to fetch page at
least three times where there shouldn't be less then 5 seconds between
individual attempts.

However if I look into crawl log file I can see that one particular
page didn't get in index due to there are only two error messages:

error #1:

050913 113818 fetching http://xxxx_some_page_xxxx.html
org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
        at org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133)
        at 
org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135)

error#2:

050913 113959 fetch of http://xxxx_some_page_xxxx.html failed with:
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded
http.max.delays: retry later

I would expect that after these two exceptions there should be one
other record; either it be next error or information that page has
been successfully fetched. But no other message in log file can be
found and page is NOT in index after fetching is finished.

Can anyone explain me what I am understanding wrong?

Regards,
Lukas

Reply via email to