Just a thought going through the fetcher code. If the robots.txt
specifies a delay >= the task timeout value, the task thread will sleep
for that amount of time and eventually be considered a "hung thread"
even though it is really just sleeping. Of course I could be reading
the code wrong. It is about 2am here. I will test this concept
tomorrow to see if that is actually what is happening with the hung threads.
Dennis
Dennis Kubes wrote:
I spoke too soon. It just took longer to hang. Still testing.
Dennis
Andrzej Bialecki wrote:
Dennis Kubes wrote:
Well, I eliminated the regular expressions and changed the timeout
value on http to 5000 and the max delays to 5 and although I still
have some task running slower and I am getting a few more timeout
errors (which is ok for what I am doing) it seems to have moved
beyond the point at which it was failing. As soon as I get this
running automatically in production I am going to try and implement
the 339 patch.
Caveat: the patch in NUTCH-339 represents work-in-progress, it
doesn't even compile. I'm going to update it shortly to a
lightly-tested compile-able version.