Nutch tuning - speed improvements that worked for me

RP Wed, 20 Dec 2006 20:24:35 -0800

Some tuning results - play with what you have and you might besurprised..!! A simple tweak to run Java as a server "-server" switch,gave a ~13% improvement as noted below for a readdb. The -server tweakdid not help on query results via Tomcat but for basic Nutch DB work, itdid pretty well (this is a standalone box and a resource limited one aswell). As such, I've got this tweaked right in the nutch file in bin soit's picked up just for Nutch. I also played with the -Xm? typesettings and if you have a memory limited machine like I do, this helpedto reduce the swapping that was really slowing things down (my Nutchinstall has 1000m heap size - way too big for my box). There are otherJava things I've not tried yet (incremental garbage collection, etc.).The experienced nutchers will have done this, but for other newbies likeme this may help, and these Java tweaks are applicable to all Nutchrevs.... Also - use jconsole to probe the jvm resources being used inreal time - my basic setup is quite a bit faster now than in the defaultconfig:


-client (default for java and Nutch)
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     2631275
retry 0:        2618055
retry 1:        6847
retry 2:        741
retry 3:        5632
min score:      0.0
avg score:      5.279
max score:      4063232.0
status 1 (DB_unfetched):        2201893
status 2 (DB_fetched):  390543
status 3 (DB_gone):     38839
CrawlDb statistics: done


real    7m34.655s
user    7m19.948s
sys     0m10.032s

-server
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     2631275
retry 0:        2618055
retry 1:        6847
retry 2:        741
retry 3:        5632
min score:      0.0
avg score:      5.279
max score:      4063232.0
status 1 (DB_unfetched):        2201893
status 2 (DB_fetched):  390543
status 3 (DB_gone):     38839
CrawlDb statistics: done

real    6m39.170s
user    6m22.691s
sys     0m10.191s

That's ~13% better....

On another note - look at the switches you have available - for meturning off filtering on the generate, and turning off parsing duringthe fetch gave a nice boost. I run filtering from time to time on thecrawldb so no need to duplicate that effort in the generate step as itreally slows it down. I just run the parse after the fetch is done, andmy combined times seem shorter than doing it in one step as I'm also CPUAND bandwidth throttled. As always, your mileage may vary so give somethings a try and you might get a nice surprise in improved speed....


--
rp

Nutch tuning - speed improvements that worked for me

Reply via email to