Ben Halsted wrote:
I'm trying to configure a single box running fetch/index/merge in a loop using the mapred branch (with ndfs).
Why are you using ndfs on a single box? It would be faster and simpler to use the local filesystem.
Could the slowdown be the index & merge processes running at the same time, or do I not have enough spiders running?
On a single box you might instead just run a single fetcher and alter the number of threads.
I suspect the slowdown is due to the fact that your crawls are dominated by a few hosts, and politeness forces you to access them slowly. Are you crawling hosts you control? If so then you might consider setting fetcher.threads.per.host to something greater than one.
Doug
