Hi, in working with Map/Reduce in Nutch 0.8, I'd like to distribute segments to multiple machines via NDFS. Let's say I've got ~250GB of hard-drive space per machine; to store terabytes of data, should I generate a bunch of ~200GB segments and push them out into NDFS?
How do I partition/organize these segments? Randomly? By URL or host? The relevant use case is to randomly access a given URL or host---or is this accomplished via map/reduce? Thanks for any insight or ideas! DaveG
