Hello,

I am currently using the recrawl script from the Nutch Wiki for crawling all 
websites from a specific small top level domain and have configured the recrawl 
script to run with THREADS=50, DEPTH=4, TOPN=25000. Which means that each time 
I run the script 100'000 pages will get crawled. 

The first time I ran the script it took 6 hours for the whole process with 
mergesegs, inverlinks, index, merge and so on. The second time it took just 3 
hours more so 9 hours, then the 4th time 12 hours but now the fourth time it is 
actually still running after 22 hours and it's only at the 64'000 page to be 
crawled. It looks like that it is especially the fetch step and the index step 
which are running much more slowly, the other steps look normal. 

So is this actually a normal behavior of Nutch ? I would expect Nutch to be 
each time a tiny little bit more slower due to updating an always growing 
database/index/segment but never so much slower as I am currently experiencing. 
Especially when right now there are only 144'915 pages indexed and the whole 
crawl directly with everything is only around 2 GB big. 

Nutch is running on a quite good Pentium 4 Xeon computer 2.8 GHz with 1 GB RAM 
and nothing else mutch running on it, also I didn't change much in the config 
of Nutch itself so it's pretty much default.

Does anyone have an idea ? I can provide more info if you desire, just let me 
know what you need.

Many thanks in advance and best regards 


      

Reply via email to