Hello,
I am currently using the recrawl script from the Nutch Wiki for crawling all
websites from a specific small top level domain and have configured the recrawl
script to run with THREADS=50, DEPTH=4, TOPN=25000. Which means that each time
I run the script 100'000 pages will get crawled.
The first time I ran the script it took 6 hours for the whole process with
mergesegs, inverlinks, index, merge and so on. The second time it took just 3
hours more so 9 hours, then the 4th time 12 hours but now the fourth time it is
actually still running after 22 hours and it's only at the 64'000 page to be
crawled. It looks like that it is especially the fetch step and the index step
which are running much more slowly, the other steps look normal.
So is this actually a normal behavior of Nutch ? I would expect Nutch to be
each time a tiny little bit more slower due to updating an always growing
database/index/segment but never so much slower as I am currently experiencing.
Especially when right now there are only 144'915 pages indexed and the whole
crawl directly with everything is only around 2 GB big.
Nutch is running on a quite good Pentium 4 Xeon computer 2.8 GHz with 1 GB RAM
and nothing else mutch running on it, also I didn't change much in the config
of Nutch itself so it's pretty much default.
Does anyone have an idea ? I can provide more info if you desire, just let me
know what you need.
Many thanks in advance and best regards