Hi Doug,

I see what you mean and it meakes very sense now.

However, this leads me to the question what exactly
fetcher.threads.per.host value is use for? More specifically what
*host* means in Nutch configuration world?

Does it mean that if fetcher.threads.per.host value is set to 1 then
concurrent crawling of two documents from the same domain name is
forbidden (e.g.: http://www.abc.com/1.html and
http://www.abc.com/2.html) while in fact these two documents might be
physically located on two different servers without our knowledge?

On the other hand one physical server can be assigned multiple domain
names so crawling for http://www.abc.com/1.html and
http://www.xyz.com/1.html concurrently means that the same server
could be in charge.
 
When setting fetcher.threads.per.host value what should I have on my
mind: DNS domain name (meaning just $1 from
http://(a-zA-Z\-_0-9).*/.*) or IP address (nslookup)?

Also I can see that this subject has been already discussed in
NUTCH-69 ticket but no solution was made.

I don't want to make this message longer but imagine the following situation:
I start crawling with three urls in mind like:
url1 (contains 2 pages),
url2 (50 pages),
url3 (2000 pages)
Now, when crawling is started with 3 threads then after url1 is
crawled then one thread becomes redundant and error rate starts
growing. After url2 is crawled (potentially not fully due to thread
collision) there are three treads leaft for one huge url3 only. This
means that I can't make url3 to be crawled fully because we are not
able to avoid thread collision in spite of the fact three threads were
needed at the beginning.

Anyway, thanks for answer!
Lukas

On 9/14/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Lukas Vlcek wrote:
> > 050913 113818 fetching http://xxxx_some_page_xxxx.html
> > org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later.
> 
> This page will be retried the next time the fetcher is run.
> 
> This message means that this thread has waited http.max.delays times for
> fetcher.server.delay seconds, and each time it found that another thread
> was already accessing a page at this site.  To avoid these, increase
> http.max.delays to a larger number, or, if you're crawling only servers
> that you control, set fetcher.threads.per.host to something greater than
> one (making the fetcher faster, but impolite).
> 
> Doug
>

Reply via email to