Lukas Vlcek wrote:
However, this leads me to the question what exactly fetcher.threads.per.host value is use for? More specifically what *host* means in Nutch configuration world?
In this case, a host is an IP address.
Does it mean that if fetcher.threads.per.host value is set to 1 then concurrent crawling of two documents from the same domain name is forbidden (e.g.: http://www.abc.com/1.html and http://www.abc.com/2.html) while in fact these two documents might be physically located on two different servers without our knowledge?
Since IP address is used, if a site uses round-robin DNS, then we could get two different IP addresses for the same host name and fetch them simultaneously. However in practice the DNS lookup will probably be cached somewhere (by the JVM or by our DNS server) so that we'll almost always get the same address for a given host.
On the other hand one physical server can be assigned multiple domain names so crawling for http://www.abc.com/1.html and http://www.xyz.com/1.html concurrently means that the same server could be in charge.
In this case only a single thread will be permitted to access this server at a time.
When setting fetcher.threads.per.host value what should I have on my mind: DNS domain name (meaning just $1 from http://(a-zA-Z\-_0-9).*/.*) or IP address (nslookup)?
IP address.
I don't want to make this message longer but imagine the following situation: I start crawling with three urls in mind like: url1 (contains 2 pages), url2 (50 pages), url3 (2000 pages) Now, when crawling is started with 3 threads then after url1 is crawled then one thread becomes redundant and error rate starts growing. After url2 is crawled (potentially not fully due to thread collision) there are three treads leaft for one huge url3 only. This means that I can't make url3 to be crawled fully because we are not able to avoid thread collision in spite of the fact three threads were needed at the beginning.
If you set http.max.delays to a large value then you will get no errors. All the threads will be used initially, then, as hosts are exhausted, threads will block each other.
Doug
