I use 0.9-dev code and local file system to crawl on a single machine. After fetching pages, nutch spends huge amount of time doing "reduce > sort" and reduce "reduce > reduce". This is not necessary since it uses only the local file system. I'm not familiar with map-reduce code, but guess it may be possible to control the number of map and reduce operations. Is it possible to configure nutch to break fetch job to only few sub-operations so that there will be only 1 or few map and reduce opresation? What setting or code can be changed to minimize the time spent on map-reduce operations when crawling with a single machine?
Thanks, AJ
