Just a thought going through the fetcher code. If the robots.txt specifies a delay >= the task timeout value, the task thread will sleep for that amount of time and eventually be considered a "hung thread" even though it is really just sleeping. Of course I could be reading the code wrong. It is about 2am here. I will test this concept tomorrow to see if that is actually what is happening with the hung threads.

Dennis

Dennis Kubes wrote:
I spoke too soon.  It just took longer to hang. Still testing.

Dennis

Andrzej Bialecki wrote:
Dennis Kubes wrote:
Well, I eliminated the regular expressions and changed the timeout value on http to 5000 and the max delays to 5 and although I still have some task running slower and I am getting a few more timeout errors (which is ok for what I am doing) it seems to have moved beyond the point at which it was failing. As soon as I get this running automatically in production I am going to try and implement the 339 patch.

Caveat: the patch in NUTCH-339 represents work-in-progress, it doesn't even compile. I'm going to update it shortly to a lightly-tested compile-able version.

Reply via email to