Here is an interesting article to support my theory.

http://nikitathespider.com/articles/RobotsTxt.html

It says that 1% of crawl delays are 900 seconds or greater. That is consistent with what I am seeing.

Dennis

Dennis Kubes wrote:
Just a thought going through the fetcher code. If the robots.txt specifies a delay >= the task timeout value, the task thread will sleep for that amount of time and eventually be considered a "hung thread" even though it is really just sleeping. Of course I could be reading the code wrong. It is about 2am here. I will test this concept tomorrow to see if that is actually what is happening with the hung threads.

Dennis

Dennis Kubes wrote:
I spoke too soon.  It just took longer to hang. Still testing.

Dennis

Andrzej Bialecki wrote:
Dennis Kubes wrote:
Well, I eliminated the regular expressions and changed the timeout value on http to 5000 and the max delays to 5 and although I still have some task running slower and I am getting a few more timeout errors (which is ok for what I am doing) it seems to have moved beyond the point at which it was failing. As soon as I get this running automatically in production I am going to try and implement the 339 patch.

Caveat: the patch in NUTCH-339 represents work-in-progress, it doesn't even compile. I'm going to update it shortly to a lightly-tested compile-able version.

Reply via email to