I've been exploring the same question lately: capping max simultaneous
tasks per node.

A file split approach would work, though it may be an indirect way of
doing it.

In many cases it would be cleaner and much easier to have a max task cap
setting, for example this could/should be configurable in a Fair
Scheduler pool setting.

But there currently doesn't exist in Hadoop any simple means (that I
know of) to set a max cap on tasks per machine, for a specific job (or
pool of jobs). You have the configured setting, which is applied
globally. If one or a few specific jobs need a different max, you're
stuck.

So the file split size approach, while indirect and more complex than a
config setting, is the only one that I know of.

The question actually has some subtlety because there is the total # of
tasks for the job, and the # that will run simultaneously. In some
cases, it's OK if there are a lot of tasks, so long as only 1 (or some
other max cap) at a time runs per machine. In other cases, you need to
limit the total # of tasks regardless of how many run simultaneously.
The file split approach will control the total # of tasks for the job,
which may impact (directly or indirectly) the # that run simultaneously.

-----Original Message-----
From:
mapreduce-dev-return-1367-michael.clements=disney....@hadoop.apache.org
[mailto:mapreduce-dev-return-1367-michael.clements=disney....@hadoop.apa
che.org] On Behalf Of Allen Wittenauer
Sent: Friday, January 15, 2010 4:00 PM
To: mapreduce-dev@hadoop.apache.org
Subject: Re: why one mapper process per block?




On 1/15/10 3:55 PM, "Erez Katz" <erez_k...@yahoo.com> wrote:
> What would it take  to pipe ALL the blocks that are part of the input
set, on
> a given node, to ONE mapper process?

Probably just setting mapred.min.split.size to a high enough value.

Reply via email to