I use 0.9-dev code and local file system to crawl on a single machine.
After fetching pages, nutch spends huge amount of time doing "reduce > sort"
and reduce "reduce > reduce". This is not necessary since it uses only the
local file system.  I'm not familiar with map-reduce code, but guess it may
be possible to control the number of map and reduce operations.  Is it
possible to configure nutch to break fetch job to only few sub-operations so
that there will be only 1 or few map and reduce opresation?  What setting or
code can be changed to minimize the time spent on map-reduce operations when
crawling with a single machine?

Thanks,
AJ

Reply via email to