Lukas Vlcek wrote:
However, this leads me to the question what exactly
fetcher.threads.per.host value is use for? More specifically what
*host* means in Nutch configuration world?

In this case, a host is an IP address.

Does it mean that if fetcher.threads.per.host value is set to 1 then
concurrent crawling of two documents from the same domain name is
forbidden (e.g.: http://www.abc.com/1.html and
http://www.abc.com/2.html) while in fact these two documents might be
physically located on two different servers without our knowledge?

Since IP address is used, if a site uses round-robin DNS, then we could get two different IP addresses for the same host name and fetch them simultaneously. However in practice the DNS lookup will probably be cached somewhere (by the JVM or by our DNS server) so that we'll almost always get the same address for a given host.

On the other hand one physical server can be assigned multiple domain
names so crawling for http://www.abc.com/1.html and
http://www.xyz.com/1.html concurrently means that the same server
could be in charge.

In this case only a single thread will be permitted to access this server at a time.

When setting fetcher.threads.per.host value what should I have on my
mind: DNS domain name (meaning just $1 from
http://(a-zA-Z\-_0-9).*/.*) or IP address (nslookup)?

IP address.

I don't want to make this message longer but imagine the following situation:
I start crawling with three urls in mind like:
url1 (contains 2 pages),
url2 (50 pages),
url3 (2000 pages)
Now, when crawling is started with 3 threads then after url1 is
crawled then one thread becomes redundant and error rate starts
growing. After url2 is crawled (potentially not fully due to thread
collision) there are three treads leaft for one huge url3 only. This
means that I can't make url3 to be crawled fully because we are not
able to avoid thread collision in spite of the fact three threads were
needed at the beginning.

If you set http.max.delays to a large value then you will get no errors. All the threads will be used initially, then, as hosts are exhausted, threads will block each other.

Doug

Reply via email to