Sorry for repeating this question. But, I have to find a solution, otherwise the crawling is too slow to be practical. I'm using nutch 0.9-dev on one linux server to crawl millions of pages. The fetching itself is reasonable, but the map-reduce operations is killing the performance. For example, fetching takes 10 hours and map-reduce also takes 10 hours, which makes the overall performance very slow. Can anyone share experience on how to speed up map-reduce for single server crawling? Single server uses local file system. It should spend very little time in doing map and reduce, isn't it right?
Thanks, -- AJ Chen, PhD http://web2express.org
