On Feb 4, 2011, at 7:46 AM, Keith Wiley wrote:
> I have since discovered that in the case of streaming, mapred.map.tasks is a
> good way to achieve this goal. Ironically, if I recall correctly, this
> seemingly obvious method for setting the number mappers did not work so well
> in my original nonstreaming case, which is why I resorted to the rather
> contrived method of calculating and setting mapred.max.split.size instead.
mapred.map.tasks basically kicks in if the input size is less than a
block. (OK, it is technically more complex than that, but ... whatever).
Given what you said in the other thread, this makes a lot more sense now as to
what is going on.
> Because all slots are not in use. It's a very larger cluster and it's
> excruciating that Hadoop partially serializes a job by piling multiple map
> tasks onto a single map in a queue even when the cluster is massively
> underutilized.
Well, sort of.
The only input hadoop has to go on is your filename input which is
relatively tiny. So of course it is going to underutilize. This makes sense
now. :)