Off the top of my head I would guess that you hit a patch of urls all from the same domain and that would slow down fetching on a single host because only one thread would be active? The generate.max.per.host config variable can limit that.

But that is just a guess. What job is it slowing down on? Yes Nutch will take more time with more data, but that is too much of a difference.

Dennis

ML mail wrote:
Hello,

I am currently using the recrawl script from the Nutch Wiki for crawling all websites from a specific small top level domain and have configured the recrawl script to run with THREADS=50, DEPTH=4, TOPN=25000. Which means that each time I run the script 100'000 pages will get crawled. The first time I ran the script it took 6 hours for the whole process with mergesegs, inverlinks, index, merge and so on. The second time it took just 3 hours more so 9 hours, then the 4th time 12 hours but now the fourth time it is actually still running after 22 hours and it's only at the 64'000 page to be crawled. It looks like that it is especially the fetch step and the index step which are running much more slowly, the other steps look normal. So is this actually a normal behavior of Nutch ? I would expect Nutch to be each time a tiny little bit more slower due to updating an always growing database/index/segment but never so much slower as I am currently experiencing. Especially when right now there are only 144'915 pages indexed and the whole crawl directly with everything is only around 2 GB big.
Nutch is running on a quite good Pentium 4 Xeon computer 2.8 GHz with 1 GB RAM 
and nothing else mutch running on it, also I didn't change much in the config 
of Nutch itself so it's pretty much default.

Does anyone have an idea ? I can provide more info if you desire, just let me 
know what you need.

Many thanks in advance and best regards

Reply via email to