Re: Fetch jumps to 1.0 complete

Dennis Kubes Sat, 05 Aug 2006 00:11:52 -0700

Here is an interesting article to support my theory.

http://nikitathespider.com/articles/RobotsTxt.html

It says that 1% of crawl delays are 900 seconds or greater. That isconsistent with what I am seeing.


Dennis

Dennis Kubes wrote:

Just a thought going through the fetcher code. If the robots.txtspecifies a delay >= the task timeout value, the task thread willsleep for that amount of time and eventually be considered a "hungthread" even though it is really just sleeping. Of course I could bereading the code wrong. It is about 2am here. I will test thisconcept tomorrow to see if that is actually what is happening with thehung threads.
Dennis

Dennis Kubes wrote:
I spoke too soon.  It just took longer to hang. Still testing.

Dennis

Andrzej Bialecki wrote:
Dennis Kubes wrote:
Well, I eliminated the regular expressions and changed the timeoutvalue on http to 5000 and the max delays to 5 and although I stillhave some task running slower and I am getting a few more timeouterrors (which is ok for what I am doing) it seems to have movedbeyond the point at which it was failing. As soon as I get thisrunning automatically in production I am going to try and implementthe 339 patch.
Caveat: the patch in NUTCH-339 represents work-in-progress, itdoesn't even compile. I'm going to update it shortly to alightly-tested compile-able version.

Re: Fetch jumps to 1.0 complete

Reply via email to