Some tuning results - play with what you have and you might be surprised..!! A simple tweak to run Java as a server "-server" switch, gave a ~13% improvement as noted below for a readdb. The -server tweak did not help on query results via Tomcat but for basic Nutch DB work, it did pretty well (this is a standalone box and a resource limited one as well). As such, I've got this tweaked right in the nutch file in bin so it's picked up just for Nutch. I also played with the -Xm? type settings and if you have a memory limited machine like I do, this helped to reduce the swapping that was really slowing things down (my Nutch install has 1000m heap size - way too big for my box). There are other Java things I've not tried yet (incremental garbage collection, etc.). The experienced nutchers will have done this, but for other newbies like me this may help, and these Java tweaks are applicable to all Nutch revs.... Also - use jconsole to probe the jvm resources being used in real time - my basic setup is quite a bit faster now than in the default config:

-client (default for java and Nutch)
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     2631275
retry 0:        2618055
retry 1:        6847
retry 2:        741
retry 3:        5632
min score:      0.0
avg score:      5.279
max score:      4063232.0
status 1 (DB_unfetched):        2201893
status 2 (DB_fetched):  390543
status 3 (DB_gone):     38839
CrawlDb statistics: done

real    7m34.655s
user    7m19.948s
sys     0m10.032s

-server
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     2631275
retry 0:        2618055
retry 1:        6847
retry 2:        741
retry 3:        5632
min score:      0.0
avg score:      5.279
max score:      4063232.0
status 1 (DB_unfetched):        2201893
status 2 (DB_fetched):  390543
status 3 (DB_gone):     38839
CrawlDb statistics: done

real    6m39.170s
user    6m22.691s
sys     0m10.191s

That's ~13% better....

On another note - look at the switches you have available - for me turning off filtering on the generate, and turning off parsing during the fetch gave a nice boost. I run filtering from time to time on the crawldb so no need to duplicate that effort in the generate step as it really slows it down. I just run the parse after the fetch is done, and my combined times seem shorter than doing it in one step as I'm also CPU AND bandwidth throttled. As always, your mileage may vary so give some things a try and you might get a nice surprise in improved speed....

--
rp



Reply via email to