Hello,

I understand that mapred.map.tasks is just a recommendation for the
framework. I also know that, by default, one map task takes as input
one block of data, so the lower limit to the number of maps is equal
to the number of blocks of input data. (There is also the
one-map-per-file constraint.) But, using the default InputFormat, I
can bypass the one-block-per-map limit by setting the
mapred.min.split.size variable. So, basically, one can control the
number of maps very precisely.

>From a performance standpoint (without considering failures), what
would be a good number of maps per node/core? If I want to finish the
task as soon as possible and I have a fixed number of nodes/cores, how
many maps should I assign per node/core? If, for example, I need the
same amount of time to process any of the records, then one map per
core might be a good choice as I would minimize the task startup time.
If different records have different processing times, then 10 maps per
core might be a good choice as, if I have 2 cores per node and one
core can finish its maps earlier it can process maps from the other
core. Overall, I would have at most 10% idle time on cores. If I
increase the number of maps, task start-up time might start to take a
large fraction of the total time, so I don't want to go there either.

Are there any Hadoop insides that would help make a better decision on
the number of maps? For example, how much time does it take to start a
task?

Thanks!
--
Rares

Reply via email to