I've been exploring the same question lately: capping max simultaneous tasks per node.
A file split approach would work, though it may be an indirect way of doing it. In many cases it would be cleaner and much easier to have a max task cap setting, for example this could/should be configurable in a Fair Scheduler pool setting. But there currently doesn't exist in Hadoop any simple means (that I know of) to set a max cap on tasks per machine, for a specific job (or pool of jobs). You have the configured setting, which is applied globally. If one or a few specific jobs need a different max, you're stuck. So the file split size approach, while indirect and more complex than a config setting, is the only one that I know of. The question actually has some subtlety because there is the total # of tasks for the job, and the # that will run simultaneously. In some cases, it's OK if there are a lot of tasks, so long as only 1 (or some other max cap) at a time runs per machine. In other cases, you need to limit the total # of tasks regardless of how many run simultaneously. The file split approach will control the total # of tasks for the job, which may impact (directly or indirectly) the # that run simultaneously. -----Original Message----- From: mapreduce-dev-return-1367-michael.clements=disney....@hadoop.apache.org [mailto:mapreduce-dev-return-1367-michael.clements=disney....@hadoop.apa che.org] On Behalf Of Allen Wittenauer Sent: Friday, January 15, 2010 4:00 PM To: mapreduce-dev@hadoop.apache.org Subject: Re: why one mapper process per block? On 1/15/10 3:55 PM, "Erez Katz" <erez_k...@yahoo.com> wrote: > What would it take to pipe ALL the blocks that are part of the input set, on > a given node, to ONE mapper process? Probably just setting mapred.min.split.size to a high enough value.