I can see why it is straight forward to have a mapper process per block, it is 
a simple "cat block | mapper "
on the other hand, when a mapper's start up time is not trivial (say I need ot 
load a fairly large dictionary), that scheme is not that ideal because that 
start up time is done per block that happened to be on that node.

What would it take  to pipe ALL the blocks that are part of the input set, on a 
given node, to ONE mapper process? 

Cheers,

  Erez Katz


      

Reply via email to