From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow after a few crawls
To: [email protected]
Date: Thursday, November 20, 2008, 2:40 PM
Off the top of my head I would guess that you hit a patch of
urls all from the same domain and that would slow down
fetching on a single host because only one thread would be
active? The generate.max.per.host config variable can limit
that.
But that is just a guess. What job is it slowing down on?
Yes Nutch will take more time with more data, but that is
too much of a difference.
Dennis
ML mail wrote:
Hello,
I am currently using the recrawl script from the Nutch
Wiki for crawling all websites from a specific small top
level domain and have configured the recrawl script to run
with THREADS=50, DEPTH=4, TOPN=25000. Which means that each
time I run the script 100'000 pages will get crawled.
The first time I ran the script it took 6 hours for
the whole process with mergesegs, inverlinks, index, merge
and so on. The second time it took just 3 hours more so 9
hours, then the 4th time 12 hours but now the fourth time it
is actually still running after 22 hours and it's only
at the 64'000 page to be crawled. It looks like that it
is especially the fetch step and the index step which are
running much more slowly, the other steps look normal.
So is this actually a normal behavior of Nutch ? I
would expect Nutch to be each time a tiny little bit more
slower due to updating an always growing
database/index/segment but never so much slower as I am
currently experiencing. Especially when right now there are
only 144'915 pages indexed and the whole crawl directly
with everything is only around 2 GB big.
Nutch is running on a quite good Pentium 4 Xeon
computer 2.8 GHz with 1 GB RAM and nothing else mutch
running on it, also I didn't change much in the config
of Nutch itself so it's pretty much default.
Does anyone have an idea ? I can provide more info if
you desire, just let me know what you need.
Many thanks in advance and best regards