[0.7] Optimize Whole Web Crawl Process

Jon Shoberg Thu, 15 Sep 2005 10:22:55 -0700

Wondering if anyone would be willing to shareoptimizations/configurations they've done for the whole web crawlingstrategy. I'm using a Dual CPU system with 4GB of ram and theperformance has been lacking. This is for a large academic domain withseveral (hundreds) or sub-domains and treating it as a whole web crawlprocess.


Questions:

1) What JVM are you using for SMP (Fedora Core 4)? Is there a JVM (withOS) where the underlying thread management will take full advantage ofboth CPUs? It appears SUN is locking nutch into one CPU.

2) What have you done for memory management? 4GB of RAM affords the JVMto grab a large memory slice but with top 10K - 50K URL segments the boxwill grind to a halt.

3) How are you scripting the processes of fetch, dedup, analyze,refetch, etc... The useful scripts from the WIKI are a good startingpoint but I'm wondering if there is a more advanced/optimizedconfiguration someone is using.

3a) Specifically, how are you handling/scripting the creation, fetching,merging of segments? What sizes? Using topN or other method?

[0.7] Optimize Whole Web Crawl Process

Reply via email to