Andrzej stated in NUTCH-669 that "some people reported performance issues
with Fetcher2, i.e. that it doesn't use the available bandwidth. These
reports are unconfirmed, and they may have been caused by suboptimal URL /
host distribution in a fetchlist - but it would be good to review the
synchronization and threading aspects of Fetcher2."
To address this, I've tried just now generating a fetchlist using
generate.max.per.host = 1 (which gave me 35,000 unique hosts) to guarantee
unique hosts, but the problem still remains.
Therefore, I believe it's clearly not an issue of suboptimal URL / host
distribution. If you require any further information to confirm my report,
you need only ask!
Cheers...
Roger
--------------------------------------------------
From: "Roger Dunk" <ro...@at.com.au>
Sent: Tuesday, March 17, 2009 7:10 PM
To: <nutch-user@lucene.apache.org>
Subject: Re: Fetcher2 Slow
Now that the soon to be released v1 uses Fetcher2 as default (or as the
only fetcher available?), I would think that this slowness problem that is
facing a number of users might be addressed?
In short the case for me is like this:
Nutch trunk revision 755143
JDK 1.6_12 on Linux
Crawl list consists of ~40,000 URLs from dmoz, so naturally are well
distributed among hosts (i.e. mostly unique hosts).
Config options:
fetcher.threads.fetch = 80
fetcher.threads.per.host = 80
fetcher.server.delay = 0
The result?
Most of the time, something like this:
activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
If I'm lucky, it might fetch around 1 page per second (or less).
What I have noticed is that if I let it run for a while, cancel the fetch,
and start it again from the beginning, it runs very quickly for a while
before it slows right down to a trickle again. My guess is that the hosts
that have cached by my caching NS are fetched quickly, but new lookups are
taking an age and slowing things down. However, I don't believe my NS is
slow by any means. And furthermore, the old Fetcher1 never had this
problem.
Any ideas where to look to track this down?
Thanks,
Roger
--------------------------------------------------
From: "Roger Dunk" <ro...@at.com.au>
Sent: Thursday, February 05, 2009 2:16 PM
To: <nutch-user@lucene.apache.org>
Subject: Re: Fetcher2 Slow
It makes no difference if I set fetcher.threads.per.host to 1 or 100,
which I assume is what you were suggesting?
I also stated that the majority of pages to fetch were from unique hosts,
so I believe the value of this parameter should not really come into
play.
Cheers...
Roger
--------------------------------------------------
From: "Laurent Laborde" <kerdez...@gmail.com>
Sent: Tuesday, February 03, 2009 5:51 PM
To: <nutch-user@lucene.apache.org>
Subject: Re: Fetcher2 Slow
On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk <ro...@at.com.au> wrote:
Hi all,
I'm having no luck whatsoever using Fetcher2, as even with 50 threads
enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and
0 hosts in the queue. I do however have some 50,000 pages to fetch, the
majority of which are from unique hosts.
The regular fetcher works as expected, fetching concurrently from 50
hosts.
There is a configuration parameters limiting the concurent fetcher per
unique host.
--
F4FQM
Kerunix Flan
Laurent Laborde