Dear Richard and others interested,

Just wanted to post the results of reducing generate.max.per.host to 25 
(instead of -1: unlimited) as recommended by Richard. 

So actually to summarize, the fetch step has been greatly reduced to 1 hour 
instead of 6 hours (for topN set at 25000) but unfortunately the generate step 
is still quite slow and takes around 4 hours (for the same topN amount). 

Is this normal for the generate step to still be so slow ? The whole index is 
only around 170'000 pages big. Is there maybe also an option in 
nutch-default.xml config file where one can optimize the generate process ?

Best regards




--- On Fri, 11/21/08, Richard Cyganiak <[EMAIL PROTECTED]> wrote:

> From: Richard Cyganiak <[EMAIL PROTECTED]>
> Subject: Re: Nutch generate and fetch very slow after a few crawls
> To: [email protected]
> Date: Friday, November 21, 2008, 2:42 AM
> On 21 Nov 2008, at 09:47, ML mail wrote:
> > What would then be the solution to this problem ?
> Shall I simply set generate.max.per.host to something like 5
> ? Or is there another way to make run Nutch at a good speed
> again ?
> 
> I spent some time trying different values for
> generate.max.per.host and I found that this is a good rule
> of thumb:
> 
> generate.max.per.host = topN / numberOfNodes / 1000
> 
> Where topN is the size of your segments, numberOfNodes is
> the number of machines in your cluser. This keeps the fetch
> rate close to maximum.
> 
> Check the log of the fetch job -- if the last few pages
> consist of request to just one or a few hosts, then your
> value for generate.max.per.host is too large. You want to
> fetch from many hosts in parallel throughout the entire
> fetch job. On the other hand, if you set it too low, then
> you will never make progress on these large sites.
> 
> I fetched the same segment repeatedly to find out what
> values work best.
> 
> Hope that helps,
> Richard
> 
> 
> > 
> > 
> > Best regards
> > 
> > 
> > --- On Thu, 11/20/08, Dennis Kubes
> <[EMAIL PROTECTED]> wrote:
> > 
> >> From: Dennis Kubes <[EMAIL PROTECTED]>
> >> Subject: Re: Nutch generate and fetch very slow
> after a few crawls
> >> To: [email protected]
> >> Date: Thursday, November 20, 2008, 2:40 PM
> >> Off the top of my head I would guess that you hit
> a patch of
> >> urls all from the same domain and that would slow
> down
> >> fetching on a single host because only one thread
> would be
> >> active?  The generate.max.per.host config variable
> can limit
> >> that.
> >> 
> >> But that is just a guess.  What job is it slowing
> down on?
> >> Yes Nutch will take more time with more data, but
> that is
> >> too much of a difference.
> >> 
> >> Dennis
> >> 
> >> ML mail wrote:
> >>> Hello,
> >>> 
> >>> I am currently using the recrawl script from
> the Nutch
> >> Wiki for crawling all websites from a specific
> small top
> >> level domain and have configured the recrawl
> script to run
> >> with THREADS=50, DEPTH=4, TOPN=25000. Which means
> that each
> >> time I run the script 100'000 pages will get
> crawled.
> >>> The first time I ran the script it took 6
> hours for
> >> the whole process with mergesegs, inverlinks,
> index, merge
> >> and so on. The second time it took just 3 hours
> more so 9
> >> hours, then the 4th time 12 hours but now the
> fourth time it
> >> is actually still running after 22 hours and
> it's only
> >> at the 64'000 page to be crawled. It looks
> like that it
> >> is especially the fetch step and the index step
> which are
> >> running much more slowly, the other steps look
> normal.
> >>> So is this actually a normal behavior of Nutch
> ? I
> >> would expect Nutch to be each time a tiny little
> bit more
> >> slower due to updating an always growing
> >> database/index/segment but never so much slower as
> I am
> >> currently experiencing. Especially when right now
> there are
> >> only 144'915 pages indexed and the whole crawl
> directly
> >> with everything is only around 2 GB big.
> >>> Nutch is running on a quite good Pentium 4
> Xeon
> >> computer 2.8 GHz with 1 GB RAM and nothing else
> mutch
> >> running on it, also I didn't change much in
> the config
> >> of Nutch itself so it's pretty much default.
> >>> 
> >>> Does anyone have an idea ? I can provide more
> info if
> >> you desire, just let me know what you need.
> >>> 
> >>> Many thanks in advance and best regards
> >>> 
> >>> 
> > 
> > 
> >


      

Reply via email to