Dear Dennis, Thank you very much for your feedback.
Your guess might be right, as I have checked in nutch-defaul.xml generate.max.per.host is set to -1. Currently it is very slow in the fetch step. It is like fetching 1 URL every 2 second so this is very slow. I also see often always the same websites, not just one but let's say a collection of a few websites which always appear again and again. So actually the fetch is still not finished since I wrote my mail yesterday evening, now running for around 34 hours. Initially I have fed nutch with an extract of around 80'000 urls using inject and URLs from DMOZ. What would then be the solution to this problem ? Shall I simply set generate.max.per.host to something like 5 ? Or is there another way to make run Nutch at a good speed again ? Best regards --- On Thu, 11/20/08, Dennis Kubes <[EMAIL PROTECTED]> wrote: > From: Dennis Kubes <[EMAIL PROTECTED]> > Subject: Re: Nutch generate and fetch very slow after a few crawls > To: [email protected] > Date: Thursday, November 20, 2008, 2:40 PM > Off the top of my head I would guess that you hit a patch of > urls all from the same domain and that would slow down > fetching on a single host because only one thread would be > active? The generate.max.per.host config variable can limit > that. > > But that is just a guess. What job is it slowing down on? > Yes Nutch will take more time with more data, but that is > too much of a difference. > > Dennis > > ML mail wrote: > > Hello, > > > > I am currently using the recrawl script from the Nutch > Wiki for crawling all websites from a specific small top > level domain and have configured the recrawl script to run > with THREADS=50, DEPTH=4, TOPN=25000. Which means that each > time I run the script 100'000 pages will get crawled. > > The first time I ran the script it took 6 hours for > the whole process with mergesegs, inverlinks, index, merge > and so on. The second time it took just 3 hours more so 9 > hours, then the 4th time 12 hours but now the fourth time it > is actually still running after 22 hours and it's only > at the 64'000 page to be crawled. It looks like that it > is especially the fetch step and the index step which are > running much more slowly, the other steps look normal. > > So is this actually a normal behavior of Nutch ? I > would expect Nutch to be each time a tiny little bit more > slower due to updating an always growing > database/index/segment but never so much slower as I am > currently experiencing. Especially when right now there are > only 144'915 pages indexed and the whole crawl directly > with everything is only around 2 GB big. > > Nutch is running on a quite good Pentium 4 Xeon > computer 2.8 GHz with 1 GB RAM and nothing else mutch > running on it, also I didn't change much in the config > of Nutch itself so it's pretty much default. > > > > Does anyone have an idea ? I can provide more info if > you desire, just let me know what you need. > > > > Many thanks in advance and best regards > > > >
