Re: silly question: why http for map output?

Bryan A. Pendleton Thu, 01 Jun 2006 11:07:27 -0700

On 6/1/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:


The mapoutput files are not located in DFS, they are on the local disks of
the mapper that creates them, avoiding the 3X replication overhead of DFS.

Wasn't there an issue to allow defining replication on a file based
level?




You *could* replicate once over using the current DFS. You probably wouldn't
want to, though: since the current mode of DFS is to chop files up into
blocks, and distribute the blocks in a uniform way across all nodes - you'd
be copying the output of a map across all nodes. This means that a *second*
copy would need to be made (from each of the destinations of the block in
DFS, to the reducer node), doubling the number of times that the block has
to be transferred across the network. And, if a single block gets lost
(remember, your 1x copy is getting distributed across all nodes, including
the possible less-reliable ones, and there are no dups), then you have to
re-run the map.

Plus, right now there's nothing enforcing that tasktracker nodes will always
be running a datanode...

--
Bryan A. Pendleton
Ph: (877) geek-1-bp

Re: silly question: why http for map output?

Reply via email to