On 5/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Doğacan Güney wrote:
> On 5/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>> Doğacan Güney wrote:
>> > Hi everyone,
>> >
>> > Has anyone tried Fetcher2 from latest trunk? On our tests, Fetcher2 is
>> > always slower (by a large margin) that Fetcher.
>> >
>> > For a segment with ~30000 urls, we ran Fetcher with 150 threads and
>> > Fetcher2 with 50 threads. Fetcher finishes around 1 hour, while
>> > Fetcher2 takes around 4 hours.  We ran this test more than once and
>> > got similar results.
>> >
>> > Are we running Fetcher2 with too few/too many threads? I was under the
>> > impression that Fetcher2 doesn't need as many threads as Fetcher since
>> > threads do not block.
>>
>>
>> Yes, that was the idea. Could you test it with the same number of
>> threads? Is the configuration identical in all other aspects?
>
> Yes, it is identical in other aspects. I am currently testing with
> same number of threads. Will report if there is a difference.
>
>>
>> Are you running the version with the fix from NUTCH-474?
>>
>>
>> >
>> > Any suggestions?
>> >
>>
>> If you already have a setup to reproduce this, you could perhaps spend
>> some time debugging this ... add some timing info, and queue info
>> logging.
>
> What do you think would be a good place(or places) to add debug info?
> Looking at the code I am not sure where to add them?

FetchItemQueues.getFetchItem() and FetchItemQueue.getFetchItem() would
be good places to start - the logging here would show how frequently
they are called, and why fetch items are not picked up (perhaps
per-queue blocking is buggy?).

I am still not sure about the source of this bug, but I think I found
some unnecessary waits in Fetcher2. Even if a url is blocked by
robots.txt (or has a crawl delay larger that max.crawl.delay),
Fetcher2 still waits fetcher.server.delay before fetching another url
from same host, which is not necessary, considering that Fetcher2
didn't make a request to server anyway.

So, I have put up a patch for this at (*) . What do you think? If you
have no objections I am going to go ahead and open an issue for this.

(*) http://www.ceng.metu.edu.tr/~e1345172/fetcher2_robots.patch





--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




--
Doğacan Güney

Reply via email to