Re: Nutch generate and fetch very slow after a few crawls

Richard Cyganiak Fri, 21 Nov 2008 02:42:42 -0800


On 21 Nov 2008, at 09:47, ML mail wrote:

What would then be the solution to this problem ? Shall I simply setgenerate.max.per.host to something like 5 ? Or is there another wayto make run Nutch at a good speed again ?

I spent some time trying different values for generate.max.per.hostand I found that this is a good rule of thumb:


generate.max.per.host = topN / numberOfNodes / 1000

Where topN is the size of your segments, numberOfNodes is the numberof machines in your cluser. This keeps the fetch rate close to maximum.

Check the log of the fetch job -- if the last few pages consist ofrequest to just one or a few hosts, then your value forgenerate.max.per.host is too large. You want to fetch from many hostsin parallel throughout the entire fetch job. On the other hand, if youset it too low, then you will never make progress on these large sites.


I fetched the same segment repeatedly to find out what values work best.

Hope that helps,
Richard



Best regards


--- On Thu, 11/20/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow after a few crawls
To: [email protected]
Date: Thursday, November 20, 2008, 2:40 PM
Off the top of my head I would guess that you hit a patch of
urls all from the same domain and that would slow down
fetching on a single host because only one thread would be
active?  The generate.max.per.host config variable can limit
that.

But that is just a guess.  What job is it slowing down on?
Yes Nutch will take more time with more data, but that is
too much of a difference.

Dennis

ML mail wrote:

Hello,

I am currently using the recrawl script from the Nutch

Wiki for crawling all websites from a specific small top
level domain and have configured the recrawl script to run
with THREADS=50, DEPTH=4, TOPN=25000. Which means that each
time I run the script 100'000 pages will get crawled.

The first time I ran the script it took 6 hours for

the whole process with mergesegs, inverlinks, index, merge
and so on. The second time it took just 3 hours more so 9
hours, then the 4th time 12 hours but now the fourth time it
is actually still running after 22 hours and it's only
at the 64'000 page to be crawled. It looks like that it
is especially the fetch step and the index step which are
running much more slowly, the other steps look normal.

So is this actually a normal behavior of Nutch ? I

would expect Nutch to be each time a tiny little bit more
slower due to updating an always growing
database/index/segment but never so much slower as I am
currently experiencing. Especially when right now there are
only 144'915 pages indexed and the whole crawl directly
with everything is only around 2 GB big.

Nutch is running on a quite good Pentium 4 Xeon

computer 2.8 GHz with 1 GB RAM and nothing else mutch
running on it, also I didn't change much in the config
of Nutch itself so it's pretty much default.


Does anyone have an idea ? I can provide more info if

you desire, just let me know what you need.


Many thanks in advance and best regards

Re: Nutch generate and fetch very slow after a few crawls

Reply via email to