Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "LimitingTaskSlotUsage" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/LimitingTaskSlotUsage

--------------------------------------------------

New page:


There are many reasons why one wants to limit the number of running tasks. 

* Job is consuming all task slots

The most common reason is because a given job is consuming all of the available 
task slots, preventing other jobs from running.   The easiest and best solution 
is to switch from the default FIFO scheduler to another scheduler, such as the 
FairShareScheduler or the CapacityScheduler.  Both support job tasks limit.

* Job has taken too many reduce slots that are still waiting for maps to finish

There is a job tunable called mapred.reduce.slowstart.completed.maps that sets 
the percentage of maps that must be completed before firing off reduce tasks.  
By default, this is set to 5% (0.05) which for most shared clusters is likely 
too low.  Recommended values are closer to 80% or higher (0.80).  Note that for 
jobs that have a significant amount of intermediate data, setting this value 
higher will cause reduce slots to take more time fetching that data before 
performing work.

* Job is referencing an external, limited resource (such as a database)

In Hadoop terms, we call this a 'side-effect'.

One of the general assumptions of the framework is that there are not any 
side-effects. All tasks are expected to be restartable and a side-effect 
typically goes against the grain of this rule.

If a task absolutely must break the rules, there are a few things one can do:

** Deploy ZooKeeper and use it as a persistent lock to keep track of how many 
tasks are running concurrently
** Use a scheduler with a maximum task-per-queue feature and submit the job to 
that queue

* Job consumes too much RAM/disk IO/etc on a given node

The CapacityScheduler in 0.21 has a feature whereby one may use RAM-per-task to 
limit how many slots a given task takes.  By careful use of this feature, one 
may limit how many concurrent tasks on a given node a job may take. 

Reply via email to