Dennis Kubes wrote:
Just a thought going through the fetcher code. If the robots.txt specifies a delay >= the task timeout value, the task thread will sleep for that amount of time and eventually be considered a "hung thread" even though it is really just sleeping. Of course I could be reading the code wrong. It is about 2am here. I will test this concept tomorrow to see if that is actually what is happening with the hung threads.

For the fetcher to die all threads would have to end up in this state. But this sort of rings a bell - this may be an unintended consequence of implementing Crawl-Delay support ...

NUTCH-339 now compiles and is lightly tested. Threads don't block there, instead they put fetchlist entries on a time-sorted queue, and continue working on other items. So, this condition never occurs.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to