Dear Dennis,

Thank you very much for your feedback. 

Your guess might be right, as I have checked in nutch-defaul.xml 
generate.max.per.host is set to -1. Currently it is very slow in the fetch 
step. It is like fetching 1 URL every 2 second so this is very slow. I also see 
often always the same websites, not just one but let's say a collection of a 
few websites which always appear again and again. So actually the fetch is 
still not finished since I wrote my mail yesterday evening, now running for 
around 34 hours.

Initially I have fed nutch with an extract of around 80'000 urls using inject 
and URLs from DMOZ.

What would then be the solution to this problem ? Shall I simply set 
generate.max.per.host to something like 5 ? Or is there another way to make run 
Nutch at a good speed again ?

Best regards


--- On Thu, 11/20/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> From: Dennis Kubes <[EMAIL PROTECTED]>
> Subject: Re: Nutch generate and fetch very slow after a few crawls
> To: [email protected]
> Date: Thursday, November 20, 2008, 2:40 PM
> Off the top of my head I would guess that you hit a patch of
> urls all from the same domain and that would slow down
> fetching on a single host because only one thread would be
> active?  The generate.max.per.host config variable can limit
> that.
> 
> But that is just a guess.  What job is it slowing down on? 
> Yes Nutch will take more time with more data, but that is
> too much of a difference.
> 
> Dennis
> 
> ML mail wrote:
> > Hello,
> > 
> > I am currently using the recrawl script from the Nutch
> Wiki for crawling all websites from a specific small top
> level domain and have configured the recrawl script to run
> with THREADS=50, DEPTH=4, TOPN=25000. Which means that each
> time I run the script 100'000 pages will get crawled. 
> > The first time I ran the script it took 6 hours for
> the whole process with mergesegs, inverlinks, index, merge
> and so on. The second time it took just 3 hours more so 9
> hours, then the 4th time 12 hours but now the fourth time it
> is actually still running after 22 hours and it's only
> at the 64'000 page to be crawled. It looks like that it
> is especially the fetch step and the index step which are
> running much more slowly, the other steps look normal. 
> > So is this actually a normal behavior of Nutch ? I
> would expect Nutch to be each time a tiny little bit more
> slower due to updating an always growing
> database/index/segment but never so much slower as I am
> currently experiencing. Especially when right now there are
> only 144'915 pages indexed and the whole crawl directly
> with everything is only around 2 GB big. 
> > Nutch is running on a quite good Pentium 4 Xeon
> computer 2.8 GHz with 1 GB RAM and nothing else mutch
> running on it, also I didn't change much in the config
> of Nutch itself so it's pretty much default.
> > 
> > Does anyone have an idea ? I can provide more info if
> you desire, just let me know what you need.
> > 
> > Many thanks in advance and best regards 
> > 
> >


      

Reply via email to