I am currently implementing a patch for the older 0.8 code that allows
pages with crawl delay > x seconds to be ignored where the number of
seconds is configurable. What do you think the best way to return from
the HttpBase would be? Would it be to throw an HttpException or return
a ProtocolStatus with say GONE or something like that?
Dennis
Dennis Kubes wrote:
I added some test code that hacks a 30 second delay when the delay is
greater than 30 seconds. It prints out the original delay value.
Here is the output I am seeing:
task_0005_m_000005_0 Someone is setting way to long of a delay
value...520 seconds
task_0005_m_000005_0 Someone is setting way to long of a delay
value...520 seconds
So far it has hit 4 of 5 fetcher threads on a single machine. I am
pretty sure this is what is causing the hung threads. I have a crawl
running now. I will update on its status later. It is now 3am here
so for now must sleep. :-P
Dennis
Andrzej Bialecki wrote:
Dennis Kubes wrote:
Just a thought going through the fetcher code. If the robots.txt
specifies a delay >= the task timeout value, the task thread will
sleep for that amount of time and eventually be considered a "hung
thread" even though it is really just sleeping. Of course I could
be reading the code wrong. It is about 2am here. I will test this
concept tomorrow to see if that is actually what is happening with
the hung threads.
For the fetcher to die all threads would have to end up in this
state. But this sort of rings a bell - this may be an unintended
consequence of implementing Crawl-Delay support ...
NUTCH-339 now compiles and is lightly tested. Threads don't block
there, instead they put fetchlist entries on a time-sorted queue,
and continue working on other items. So, this condition never occurs.
- Re: Fetch jumps to 1.0 complete Dennis Kubes
-