Hi,

This is just an observation and a warning for those of you who are crawling single sites in depth, and encountered frequent "Exceeded http.max.delays" exception.

Assume the following scenario: a user runs the CrawlTool to crawl a single site. Fetchlists generated by the CrawlTool will contain only URLs from that site (which is to say, from the same IP). But now the logic in Http.blockAddr() badly interacts with Fetcher, and deadlocks FetcherThreads - by default the Fetcher starts 10 threads. Each of these threads tries to access the same IP address, but the default value of fetcher.threads.per.host is just 1. This means that only the first thread will be allowed to run, other 9 threads will be spinning, waiting for the first thread to finish. Eventually, some of these waiting threads will exceed the maximum wait time, thus throwing the above exception.

I don't see immediately what would be the best solution to this problem. One way to solve this would be for the CrawlTool to automatically adjust "fetcher.threads.per.host" value so that it follows the formula "fetcher.threads / number of urls in the urllist". I did this manually in the config file, and on the command line to the CrawlTool, so can you to avoid this problem for now.

Any comments?

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to