Hello Richard What you just explained looks exactly to be the problem I am currently experiencing. So as you advised and because we have just 1 node with topN set to 25000 pages I have set generate.max.per.host to 25. I will have the crawler run this night to test this new setting and will keep you posted about the results hopefully tomorrow if the crawl got faster.
Thank you for your great help, I am pretty sure that will fix it, at least I hope very much ;-) Regards --- On Fri, 11/21/08, Richard Cyganiak <[EMAIL PROTECTED]> wrote: > From: Richard Cyganiak <[EMAIL PROTECTED]> > Subject: Re: Nutch generate and fetch very slow after a few crawls > To: [email protected] > Date: Friday, November 21, 2008, 2:42 AM > On 21 Nov 2008, at 09:47, ML mail wrote: > > What would then be the solution to this problem ? > Shall I simply set generate.max.per.host to something like 5 > ? Or is there another way to make run Nutch at a good speed > again ? > > I spent some time trying different values for > generate.max.per.host and I found that this is a good rule > of thumb: > > generate.max.per.host = topN / numberOfNodes / 1000 > > Where topN is the size of your segments, numberOfNodes is > the number of machines in your cluser. This keeps the fetch > rate close to maximum. > > Check the log of the fetch job -- if the last few pages > consist of request to just one or a few hosts, then your > value for generate.max.per.host is too large. You want to > fetch from many hosts in parallel throughout the entire > fetch job. On the other hand, if you set it too low, then > you will never make progress on these large sites. > > I fetched the same segment repeatedly to find out what > values work best. > > Hope that helps, > Richard > > > > > > > > Best regards > > > > > > --- On Thu, 11/20/08, Dennis Kubes > <[EMAIL PROTECTED]> wrote: > > > >> From: Dennis Kubes <[EMAIL PROTECTED]> > >> Subject: Re: Nutch generate and fetch very slow > after a few crawls > >> To: [email protected] > >> Date: Thursday, November 20, 2008, 2:40 PM > >> Off the top of my head I would guess that you hit > a patch of > >> urls all from the same domain and that would slow > down > >> fetching on a single host because only one thread > would be > >> active? The generate.max.per.host config variable > can limit > >> that. > >> > >> But that is just a guess. What job is it slowing > down on? > >> Yes Nutch will take more time with more data, but > that is > >> too much of a difference. > >> > >> Dennis > >> > >> ML mail wrote: > >>> Hello, > >>> > >>> I am currently using the recrawl script from > the Nutch > >> Wiki for crawling all websites from a specific > small top > >> level domain and have configured the recrawl > script to run > >> with THREADS=50, DEPTH=4, TOPN=25000. Which means > that each > >> time I run the script 100'000 pages will get > crawled. > >>> The first time I ran the script it took 6 > hours for > >> the whole process with mergesegs, inverlinks, > index, merge > >> and so on. The second time it took just 3 hours > more so 9 > >> hours, then the 4th time 12 hours but now the > fourth time it > >> is actually still running after 22 hours and > it's only > >> at the 64'000 page to be crawled. It looks > like that it > >> is especially the fetch step and the index step > which are > >> running much more slowly, the other steps look > normal. > >>> So is this actually a normal behavior of Nutch > ? I > >> would expect Nutch to be each time a tiny little > bit more > >> slower due to updating an always growing > >> database/index/segment but never so much slower as > I am > >> currently experiencing. Especially when right now > there are > >> only 144'915 pages indexed and the whole crawl > directly > >> with everything is only around 2 GB big. > >>> Nutch is running on a quite good Pentium 4 > Xeon > >> computer 2.8 GHz with 1 GB RAM and nothing else > mutch > >> running on it, also I didn't change much in > the config > >> of Nutch itself so it's pretty much default. > >>> > >>> Does anyone have an idea ? I can provide more > info if > >> you desire, just let me know what you need. > >>> > >>> Many thanks in advance and best regards > >>> > >>> > > > > > >
