Hi Doug, I see what you mean and it meakes very sense now.
However, this leads me to the question what exactly fetcher.threads.per.host value is use for? More specifically what *host* means in Nutch configuration world? Does it mean that if fetcher.threads.per.host value is set to 1 then concurrent crawling of two documents from the same domain name is forbidden (e.g.: http://www.abc.com/1.html and http://www.abc.com/2.html) while in fact these two documents might be physically located on two different servers without our knowledge? On the other hand one physical server can be assigned multiple domain names so crawling for http://www.abc.com/1.html and http://www.xyz.com/1.html concurrently means that the same server could be in charge. When setting fetcher.threads.per.host value what should I have on my mind: DNS domain name (meaning just $1 from http://(a-zA-Z\-_0-9).*/.*) or IP address (nslookup)? Also I can see that this subject has been already discussed in NUTCH-69 ticket but no solution was made. I don't want to make this message longer but imagine the following situation: I start crawling with three urls in mind like: url1 (contains 2 pages), url2 (50 pages), url3 (2000 pages) Now, when crawling is started with 3 threads then after url1 is crawled then one thread becomes redundant and error rate starts growing. After url2 is crawled (potentially not fully due to thread collision) there are three treads leaft for one huge url3 only. This means that I can't make url3 to be crawled fully because we are not able to avoid thread collision in spite of the fact three threads were needed at the beginning. Anyway, thanks for answer! Lukas On 9/14/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Lukas Vlcek wrote: > > 050913 113818 fetching http://xxxx_some_page_xxxx.html > > org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. > > This page will be retried the next time the fetcher is run. > > This message means that this thread has waited http.max.delays times for > fetcher.server.delay seconds, and each time it found that another thread > was already accessing a page at this site. To avoid these, increase > http.max.delays to a larger number, or, if you're crawling only servers > that you control, set fetcher.threads.per.host to something greater than > one (making the fetcher faster, but impolite). > > Doug >
