I use 0.9-dev code and local file system to crawl on a single machine.
After fetching pages, nutch spends huge amount of time doing "reduce > sort"
and reduce "reduce > reduce". This is not necessary since it uses only the
local file system.  I'm not familiar with map-reduce code, but guess it may
be possible to control the number of map and reduce operations.  Is it
possible to configure nutch to break fetch job to only few sub-operations so
that there will be only 1 or few map and reduce opresation?  What setting or
code can be changed to minimize the time spent on map-reduce operations when
crawling with a single machine?

Thanks,
AJ
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to