Wondering if anyone would be willing to share optimizations/configurations they've done for the whole web crawling strategy. I'm using a Dual CPU system with 4GB of ram and the performance has been lacking. This is for a large academic domain with several (hundreds) or sub-domains and treating it as a whole web crawl process.

Questions:

1) What JVM are you using for SMP (Fedora Core 4)? Is there a JVM (with OS) where the underlying thread management will take full advantage of both CPUs? It appears SUN is locking nutch into one CPU.

2) What have you done for memory management? 4GB of RAM affords the JVM to grab a large memory slice but with top 10K - 50K URL segments the box will grind to a halt.

3) How are you scripting the processes of fetch, dedup, analyze, refetch, etc... The useful scripts from the WIKI are a good starting point but I'm wondering if there is a more advanced/optimized configuration someone is using.

3a) Specifically, how are you handling/scripting the creation, fetching, merging of segments? What sizes? Using topN or other method?






Reply via email to