On 21 Nov 2008, at 09:47, ML mail wrote:
What would then be the solution to this problem ? Shall I simply set generate.max.per.host to something like 5 ? Or is there another way to make run Nutch at a good speed again ?

I spent some time trying different values for generate.max.per.host and I found that this is a good rule of thumb:

generate.max.per.host = topN / numberOfNodes / 1000

Where topN is the size of your segments, numberOfNodes is the number of machines in your cluser. This keeps the fetch rate close to maximum.

Check the log of the fetch job -- if the last few pages consist of request to just one or a few hosts, then your value for generate.max.per.host is too large. You want to fetch from many hosts in parallel throughout the entire fetch job. On the other hand, if you set it too low, then you will never make progress on these large sites.

I fetched the same segment repeatedly to find out what values work best.

Hope that helps,
Richard




Best regards


--- On Thu, 11/20/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow after a few crawls
To: [email protected]
Date: Thursday, November 20, 2008, 2:40 PM
Off the top of my head I would guess that you hit a patch of
urls all from the same domain and that would slow down
fetching on a single host because only one thread would be
active?  The generate.max.per.host config variable can limit
that.

But that is just a guess.  What job is it slowing down on?
Yes Nutch will take more time with more data, but that is
too much of a difference.

Dennis

ML mail wrote:
Hello,

I am currently using the recrawl script from the Nutch
Wiki for crawling all websites from a specific small top
level domain and have configured the recrawl script to run
with THREADS=50, DEPTH=4, TOPN=25000. Which means that each
time I run the script 100'000 pages will get crawled.
The first time I ran the script it took 6 hours for
the whole process with mergesegs, inverlinks, index, merge
and so on. The second time it took just 3 hours more so 9
hours, then the 4th time 12 hours but now the fourth time it
is actually still running after 22 hours and it's only
at the 64'000 page to be crawled. It looks like that it
is especially the fetch step and the index step which are
running much more slowly, the other steps look normal.
So is this actually a normal behavior of Nutch ? I
would expect Nutch to be each time a tiny little bit more
slower due to updating an always growing
database/index/segment but never so much slower as I am
currently experiencing. Especially when right now there are
only 144'915 pages indexed and the whole crawl directly
with everything is only around 2 GB big.
Nutch is running on a quite good Pentium 4 Xeon
computer 2.8 GHz with 1 GB RAM and nothing else mutch
running on it, also I didn't change much in the config
of Nutch itself so it's pretty much default.

Does anyone have an idea ? I can provide more info if
you desire, just let me know what you need.

Many thanks in advance and best regards






Reply via email to