Dear Richard and others interested, Just wanted to post the results of reducing generate.max.per.host to 25 (instead of -1: unlimited) as recommended by Richard.
So actually to summarize, the fetch step has been greatly reduced to 1 hour instead of 6 hours (for topN set at 25000) but unfortunately the generate step is still quite slow and takes around 4 hours (for the same topN amount). Is this normal for the generate step to still be so slow ? The whole index is only around 170'000 pages big. Is there maybe also an option in nutch-default.xml config file where one can optimize the generate process ? Best regards --- On Fri, 11/21/08, Richard Cyganiak <[EMAIL PROTECTED]> wrote: > From: Richard Cyganiak <[EMAIL PROTECTED]> > Subject: Re: Nutch generate and fetch very slow after a few crawls > To: [email protected] > Date: Friday, November 21, 2008, 2:42 AM > On 21 Nov 2008, at 09:47, ML mail wrote: > > What would then be the solution to this problem ? > Shall I simply set generate.max.per.host to something like 5 > ? Or is there another way to make run Nutch at a good speed > again ? > > I spent some time trying different values for > generate.max.per.host and I found that this is a good rule > of thumb: > > generate.max.per.host = topN / numberOfNodes / 1000 > > Where topN is the size of your segments, numberOfNodes is > the number of machines in your cluser. This keeps the fetch > rate close to maximum. > > Check the log of the fetch job -- if the last few pages > consist of request to just one or a few hosts, then your > value for generate.max.per.host is too large. You want to > fetch from many hosts in parallel throughout the entire > fetch job. On the other hand, if you set it too low, then > you will never make progress on these large sites. > > I fetched the same segment repeatedly to find out what > values work best. > > Hope that helps, > Richard > > > > > > > > Best regards > > > > > > --- On Thu, 11/20/08, Dennis Kubes > <[EMAIL PROTECTED]> wrote: > > > >> From: Dennis Kubes <[EMAIL PROTECTED]> > >> Subject: Re: Nutch generate and fetch very slow > after a few crawls > >> To: [email protected] > >> Date: Thursday, November 20, 2008, 2:40 PM > >> Off the top of my head I would guess that you hit > a patch of > >> urls all from the same domain and that would slow > down > >> fetching on a single host because only one thread > would be > >> active? The generate.max.per.host config variable > can limit > >> that. > >> > >> But that is just a guess. What job is it slowing > down on? > >> Yes Nutch will take more time with more data, but > that is > >> too much of a difference. > >> > >> Dennis > >> > >> ML mail wrote: > >>> Hello, > >>> > >>> I am currently using the recrawl script from > the Nutch > >> Wiki for crawling all websites from a specific > small top > >> level domain and have configured the recrawl > script to run > >> with THREADS=50, DEPTH=4, TOPN=25000. Which means > that each > >> time I run the script 100'000 pages will get > crawled. > >>> The first time I ran the script it took 6 > hours for > >> the whole process with mergesegs, inverlinks, > index, merge > >> and so on. The second time it took just 3 hours > more so 9 > >> hours, then the 4th time 12 hours but now the > fourth time it > >> is actually still running after 22 hours and > it's only > >> at the 64'000 page to be crawled. It looks > like that it > >> is especially the fetch step and the index step > which are > >> running much more slowly, the other steps look > normal. > >>> So is this actually a normal behavior of Nutch > ? I > >> would expect Nutch to be each time a tiny little > bit more > >> slower due to updating an always growing > >> database/index/segment but never so much slower as > I am > >> currently experiencing. Especially when right now > there are > >> only 144'915 pages indexed and the whole crawl > directly > >> with everything is only around 2 GB big. > >>> Nutch is running on a quite good Pentium 4 > Xeon > >> computer 2.8 GHz with 1 GB RAM and nothing else > mutch > >> running on it, also I didn't change much in > the config > >> of Nutch itself so it's pretty much default. > >>> > >>> Does anyone have an idea ? I can provide more > info if > >> you desire, just let me know what you need. > >>> > >>> Many thanks in advance and best regards > >>> > >>> > > > > > >
