Well, generate will still have to go through all of the urls, although skipping after 25 per domain should be really quick. It really depends on your hardware and any regexes you may be running in urlfilters for the generate step. A 2.8Ghz Xeon with 1G ram should be pretty quick. 100,000 pages on a core2duo on my laptop (4G Ram) takes less than an hour if I remember correctly. What type of hard drive (speed) are you using and are you swapping a lot during generate? It may be the amount of RAM.

Dennis

ML mail wrote:
Dear Richard and others interested,

Just wanted to post the results of reducing generate.max.per.host to 25 (instead of -1: unlimited) as recommended by Richard. So actually to summarize, the fetch step has been greatly reduced to 1 hour instead of 6 hours (for topN set at 25000) but unfortunately the generate step is still quite slow and takes around 4 hours (for the same topN amount).
Is this normal for the generate step to still be so slow ? The whole index is 
only around 170'000 pages big. Is there maybe also an option in 
nutch-default.xml config file where one can optimize the generate process ?

Best regards




--- On Fri, 11/21/08, Richard Cyganiak <[EMAIL PROTECTED]> wrote:

From: Richard Cyganiak <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow after a few crawls
To: [email protected]
Date: Friday, November 21, 2008, 2:42 AM
On 21 Nov 2008, at 09:47, ML mail wrote:
What would then be the solution to this problem ?
Shall I simply set generate.max.per.host to something like 5
? Or is there another way to make run Nutch at a good speed
again ?

I spent some time trying different values for
generate.max.per.host and I found that this is a good rule
of thumb:

generate.max.per.host = topN / numberOfNodes / 1000

Where topN is the size of your segments, numberOfNodes is
the number of machines in your cluser. This keeps the fetch
rate close to maximum.

Check the log of the fetch job -- if the last few pages
consist of request to just one or a few hosts, then your
value for generate.max.per.host is too large. You want to
fetch from many hosts in parallel throughout the entire
fetch job. On the other hand, if you set it too low, then
you will never make progress on these large sites.

I fetched the same segment repeatedly to find out what
values work best.

Hope that helps,
Richard



Best regards


--- On Thu, 11/20/08, Dennis Kubes
<[EMAIL PROTECTED]> wrote:
From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow
after a few crawls
To: [email protected]
Date: Thursday, November 20, 2008, 2:40 PM
Off the top of my head I would guess that you hit
a patch of
urls all from the same domain and that would slow
down
fetching on a single host because only one thread
would be
active?  The generate.max.per.host config variable
can limit
that.

But that is just a guess.  What job is it slowing
down on?
Yes Nutch will take more time with more data, but
that is
too much of a difference.

Dennis

ML mail wrote:
Hello,

I am currently using the recrawl script from
the Nutch
Wiki for crawling all websites from a specific
small top
level domain and have configured the recrawl
script to run
with THREADS=50, DEPTH=4, TOPN=25000. Which means
that each
time I run the script 100'000 pages will get
crawled.
The first time I ran the script it took 6
hours for
the whole process with mergesegs, inverlinks,
index, merge
and so on. The second time it took just 3 hours
more so 9
hours, then the 4th time 12 hours but now the
fourth time it
is actually still running after 22 hours and
it's only
at the 64'000 page to be crawled. It looks
like that it
is especially the fetch step and the index step
which are
running much more slowly, the other steps look
normal.
So is this actually a normal behavior of Nutch
? I
would expect Nutch to be each time a tiny little
bit more
slower due to updating an always growing
database/index/segment but never so much slower as
I am
currently experiencing. Especially when right now
there are
only 144'915 pages indexed and the whole crawl
directly
with everything is only around 2 GB big.
Nutch is running on a quite good Pentium 4
Xeon
computer 2.8 GHz with 1 GB RAM and nothing else
mutch
running on it, also I didn't change much in
the config
of Nutch itself so it's pretty much default.
Does anyone have an idea ? I can provide more
info if
you desire, just let me know what you need.
Many thanks in advance and best regards






Reply via email to