Andrzej stated in NUTCH-669 that "some people reported performance issues with Fetcher2, i.e. that it doesn't use the available bandwidth. These reports are unconfirmed, and they may have been caused by suboptimal URL / host distribution in a fetchlist - but it would be good to review the synchronization and threading aspects of Fetcher2."

To address this, I've tried just now generating a fetchlist using generate.max.per.host = 1 (which gave me 35,000 unique hosts) to guarantee unique hosts, but the problem still remains.

Therefore, I believe it's clearly not an issue of suboptimal URL / host distribution. If you require any further information to confirm my report, you need only ask!

Cheers...
Roger

--------------------------------------------------
From: "Roger Dunk" <ro...@at.com.au>
Sent: Tuesday, March 17, 2009 7:10 PM
To: <nutch-user@lucene.apache.org>
Subject: Re: Fetcher2 Slow

Now that the soon to be released v1 uses Fetcher2 as default (or as the only fetcher available?), I would think that this slowness problem that is facing a number of users might be addressed?

In short the case for me is like this:

Nutch trunk revision 755143
JDK 1.6_12 on Linux

Crawl list consists of ~40,000 URLs from dmoz, so naturally are well distributed among hosts (i.e. mostly unique hosts).

Config options:
fetcher.threads.fetch = 80
fetcher.threads.per.host = 80
fetcher.server.delay = 0

The result?

Most of the time, something like this:

activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0

If I'm lucky, it might fetch around 1 page per second (or less).

What I have noticed is that if I let it run for a while, cancel the fetch, and start it again from the beginning, it runs very quickly for a while before it slows right down to a trickle again. My guess is that the hosts that have cached by my caching NS are fetched quickly, but new lookups are taking an age and slowing things down. However, I don't believe my NS is slow by any means. And furthermore, the old Fetcher1 never had this problem.

Any ideas where to look to track this down?

Thanks,
Roger

--------------------------------------------------
From: "Roger Dunk" <ro...@at.com.au>
Sent: Thursday, February 05, 2009 2:16 PM
To: <nutch-user@lucene.apache.org>
Subject: Re: Fetcher2 Slow

It makes no difference if I set fetcher.threads.per.host to 1 or 100, which I assume is what you were suggesting?

I also stated that the majority of pages to fetch were from unique hosts, so I believe the value of this parameter should not really come into play.

Cheers...
Roger

--------------------------------------------------
From: "Laurent Laborde" <kerdez...@gmail.com>
Sent: Tuesday, February 03, 2009 5:51 PM
To: <nutch-user@lucene.apache.org>
Subject: Re: Fetcher2 Slow

On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk <ro...@at.com.au> wrote:
Hi all,

I'm having no luck whatsoever using Fetcher2, as even with 50 threads enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 0 hosts in the queue. I do however have some 50,000 pages to fetch, the majority of which are from unique hosts.

The regular fetcher works as expected, fetching concurrently from 50 hosts.

There is a configuration parameters limiting the concurent fetcher per
unique host.

--
F4FQM
Kerunix Flan
Laurent Laborde

Reply via email to