On Feb 4, 2011, at 7:46 AM, Keith Wiley wrote:
> I have since discovered that in the case of streaming, mapred.map.tasks is a 
> good way to achieve this goal.  Ironically, if I recall correctly, this 
> seemingly obvious method for setting the number mappers did not work so well 
> in my original nonstreaming case, which is why I resorted to the rather 
> contrived method of calculating and setting mapred.max.split.size instead.

        mapred.map.tasks basically kicks in if the input size is less than a 
block.  (OK, it is technically more complex than that, but ... whatever).  
Given what you said in the other thread, this makes a lot more sense now as to 
what is going on.

> Because all slots are not in use.  It's a very larger cluster and it's 
> excruciating that Hadoop partially serializes a job by piling multiple map 
> tasks onto a single map in a queue even when the cluster is massively 
> underutilized.

        Well, sort of.

        The only input hadoop has to go on is your filename input which is 
relatively tiny.  So of course it is going to underutilize.  This makes sense 
now. :)



Reply via email to