On 6/1/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
The mapoutput files are not located in DFS, they are on the local disks of the mapper that creates them, avoiding the 3X replication overhead of DFS. Wasn't there an issue to allow defining replication on a file based level?
You *could* replicate once over using the current DFS. You probably wouldn't want to, though: since the current mode of DFS is to chop files up into blocks, and distribute the blocks in a uniform way across all nodes - you'd be copying the output of a map across all nodes. This means that a *second* copy would need to be made (from each of the destinations of the block in DFS, to the reducer node), doubling the number of times that the block has to be transferred across the network. And, if a single block gets lost (remember, your 1x copy is getting distributed across all nodes, including the possible less-reliable ones, and there are no dups), then you have to re-run the map. Plus, right now there's nothing enforcing that tasktracker nodes will always be running a datanode... -- Bryan A. Pendleton Ph: (877) geek-1-bp
