Here is an interesting article to support my theory.
http://nikitathespider.com/articles/RobotsTxt.html
It says that 1% of crawl delays are 900 seconds or greater. That is
consistent with what I am seeing.
Dennis
Dennis Kubes wrote:
Just a thought going through the fetcher code. If the robots.txt
specifies a delay >= the task timeout value, the task thread will
sleep for that amount of time and eventually be considered a "hung
thread" even though it is really just sleeping. Of course I could be
reading the code wrong. It is about 2am here. I will test this
concept tomorrow to see if that is actually what is happening with the
hung threads.
Dennis
Dennis Kubes wrote:
I spoke too soon. It just took longer to hang. Still testing.
Dennis
Andrzej Bialecki wrote:
Dennis Kubes wrote:
Well, I eliminated the regular expressions and changed the timeout
value on http to 5000 and the max delays to 5 and although I still
have some task running slower and I am getting a few more timeout
errors (which is ok for what I am doing) it seems to have moved
beyond the point at which it was failing. As soon as I get this
running automatically in production I am going to try and implement
the 339 patch.
Caveat: the patch in NUTCH-339 represents work-in-progress, it
doesn't even compile. I'm going to update it shortly to a
lightly-tested compile-able version.