Hi, in working with Map/Reduce in Nutch 0.8, I'd like to distribute
segments to multiple machines via NDFS.  Let's say I've got ~250GB of
hard-drive space per machine; to store terabytes of data, should I
generate a bunch of ~200GB segments and push them out into NDFS?

 

How do I partition/organize these segments?  Randomly?  By URL or host?
The relevant use case is to randomly access a given URL or host---or is
this accomplished via map/reduce?

 

Thanks for any insight or ideas!

 

DaveG

Reply via email to