Re: Fetch jumps to 1.0 complete

Dennis Kubes Wed, 09 Aug 2006 08:14:41 -0700

I am currently implementing a patch for the older 0.8 code that allowspages with crawl delay > x seconds to be ignored where the number ofseconds is configurable. What do you think the best way to return fromthe HttpBase would be? Would it be to throw an HttpException or returna ProtocolStatus with say GONE or something like that?


Dennis

Dennis Kubes wrote:
I added some test code that hacks a 30 second delay when the delay isgreater than 30 seconds. It prints out the original delay value.Here is the output I am seeing:
task_0005_m_000005_0 Someone is setting way to long of a delayvalue...520 secondstask_0005_m_000005_0 Someone is setting way to long of a delayvalue...520 seconds
So far it has hit 4 of 5 fetcher threads on a single machine. I ampretty sure this is what is causing the hung threads. I have a crawlrunning now. I will update on its status later. It is now 3am hereso for now must sleep. :-P
Dennis

Andrzej Bialecki wrote:
Dennis Kubes wrote:
Just a thought going through the fetcher code. If the robots.txtspecifies a delay >= the task timeout value, the task thread willsleep for that amount of time and eventually be considered a "hungthread" even though it is really just sleeping. Of course I couldbe reading the code wrong. It is about 2am here. I will test thisconcept tomorrow to see if that is actually what is happening withthe hung threads.
For the fetcher to die all threads would have to end up in thisstate. But this sort of rings a bell - this may be an unintendedconsequence of implementing Crawl-Delay support ...
NUTCH-339 now compiles and is lightly tested. Threads don't blockthere, instead they put fetchlist entries on a time-sorted queue,and continue working on other items. So, this condition never occurs.

Re: Fetch jumps to 1.0 complete

Reply via email to