urls blocked db.fetch.retry.max * http.max.delays times during fetching are 
marked as STATUS_DB_GONE  
------------------------------------------------------------------------------------------------------

                 Key: NUTCH-350
                 URL: http://issues.apache.org/jira/browse/NUTCH-350
             Project: Nutch
          Issue Type: Bug
            Reporter: Stefan Groschupf
            Priority: Critical


Intranet crawls or focused crawls will fetch many pages from the same host. 
This causes that a thread will be blocked since a other thread already fetch 
from the same host. It is very likely that threads are more often blocked than 
http.max.delays. In such a case the HttpBase.blockAddr method throws a 
HttpException. This will be handled in the fetcher by increment the crawlDatum 
retries and set the status to STATUS_FETCH_RETRY. That means that at least you 
have only db.fetch.retry.max * http.max.delays chances to fetch a url. But in 
intranet or focused crawls it is very likely that this is not enough. Increaing 
one of the involved properties dramatically slow down the fetch. 
I suggest to not increase the CrawlDatum RetriesSinceFetch in case the problem 
was caused by a blocked thread.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to