Some tuning results - play with what you have and you might be surprised..!! A simple tweak to run Java as a server "-server" switch, gave a ~13% improvement as noted below for a readdb. The -server tweak did not help on query results via Tomcat but for basic Nutch DB work, it did pretty well (this is a standalone box and a resource limited one as well). As such, I've got this tweaked right in the nutch file in bin so it's picked up just for Nutch. I also played with the -Xm? type settings and if you have a memory limited machine like I do, this helped to reduce the swapping that was really slowing things down (my Nutch install has 1000m heap size - way too big for my box). There are other Java things I've not tried yet (incremental garbage collection, etc.). The experienced nutchers will have done this, but for other newbies like me this may help, and these Java tweaks are applicable to all Nutch revs.... Also - use jconsole to probe the jvm resources being used in real time - my basic setup is quite a bit faster now than in the default config:
-client (default for java and Nutch) CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 2631275 retry 0: 2618055 retry 1: 6847 retry 2: 741 retry 3: 5632 min score: 0.0 avg score: 5.279 max score: 4063232.0 status 1 (DB_unfetched): 2201893 status 2 (DB_fetched): 390543 status 3 (DB_gone): 38839 CrawlDb statistics: done real 7m34.655s user 7m19.948s sys 0m10.032s -server CrawlDb statistics start: crawl/crawldb Statistics for CrawlDb: crawl/crawldb TOTAL urls: 2631275 retry 0: 2618055 retry 1: 6847 retry 2: 741 retry 3: 5632 min score: 0.0 avg score: 5.279 max score: 4063232.0 status 1 (DB_unfetched): 2201893 status 2 (DB_fetched): 390543 status 3 (DB_gone): 38839 CrawlDb statistics: done real 6m39.170s user 6m22.691s sys 0m10.191s That's ~13% better.... On another note - look at the switches you have available - for me turning off filtering on the generate, and turning off parsing during the fetch gave a nice boost. I run filtering from time to time on the crawldb so no need to duplicate that effort in the generate step as it really slows it down. I just run the parse after the fetch is done, and my combined times seem shorter than doing it in one step as I'm also CPU AND bandwidth throttled. As always, your mileage may vary so give some things a try and you might get a nice surprise in improved speed.... -- rp ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
