[ 
https://issues.apache.org/jira/browse/HADOOP-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644267#action_12644267
 ] 

Vivek Ratan commented on HADOOP-4035:
-------------------------------------

Since the issue of dealing with memory-intensive and badly behaved jobs has 
spanned more than one Jira, here's the latest summary on the overall proposal 
(following some offline discussions). 

The problem, as stated originally in HADOOP-3581, is that certain badly-behaved 
jobs end up using too much memory on a node and can bring down that node. We 
need to prevent this. A related requirement, as described in HADOOP-3759, is 
that the system respects different, and legitimate, memory requirements of 
different jobs.

There are two independent parts to solving this problem: monitoring and 
scheduling. Let's look at monitoring first. 

Monitoring
----------------

We want to ensure that the sum total of virtual memory (VM) usage by all tasks 
does not go over a limit (call this the _max-VM-per-node_ limit). That's really 
what brings down a machine. To detect badly behaved jobs, we want to associate 
a limit with each task (call this the _max-VM-per-task_ limit) such that a task 
is considered badly behaved if its VM usage goes over this limit. Think of the 
_max-VM-per-task_ limit as a kill limit. A TT monitors each task for its memory 
usage (this includes the memory used by the task's descendants). If a task's 
memory usage goes over its  _max-VM-per-task_ limit, that task is killed. This 
monitoring has been implemented in HADOOP-3581. In addition, a TT monitors the 
total memory usage of all tasks spawned by the TT. If this value goes over the 
_max-VM-per-node_ limit, the TT needs to kill one or more tasks. As a simple 
solution, the TT can kill one or more tasks that started most recently. This 
approach has been suggested in HADOOP-4523. Tasks that are killed because they 
went over their memory limit should be treated as failed, since they violated 
their contract. Tasks that are killed because the sum total of memory usage was 
over a limit should be treated as killed, since it's not really their fault. 

How do we specify these limits? 
* *for _max-VM-per-node_*: HADOOP-3581 provides a config option, 
_mapred.tasktracker.tasks.maxmemory_ , which acts as the _max-VM-per-node_ 
limit. As per discussions in this Jira, and in HADOOP-4523, this needs to be 
enhanced.  _mapred.tasktracker.tasks.maxmemory_ should be replaced by 
_mapred.tasktracker.virtualmemory.reserved_, which indicates an offset (in 
MB?). _max-VM-per-node_ is then the total VM on the machine, minus this offset. 
How do we get the total VM on the machine? This can be done by the plugin 
interface that Owen proposed earlier. 
* *for _max-VM-per-task_*: HADOOP-3759 and HADOOP-4439 define a cluster-wide 
configuration, _mapred.task.default.maxmemory_, that describes the default 
maximum VM associated per task. Rename it to _mapred.task.default.maxvm_ for 
consistency. This is the default _max-VM-per-task_ limit associated with a 
task. To support jobs that need higher or lower limits, this value can be 
overridden by individual jobs. A job can set a config value, 
_mapred.task.maxvm_, which overrides _mapred.task.default.maxvm_ for all tasks 
for that job. 
* Furthermore, as described earlier in this Jira, we want to prevent users from 
setting _mapred.task.maxvm_ to an arbitrarily high number and thus gaming the 
system. To do this, there should be a cluster-wide setting, 
_mapred.task.limit.maxvm_, that limits the value of _mapred.task.maxvm_. If 
_mapred.task.maxvm_ is set to a value higher than _mapred.task.limit.maxvm_, 
the job should not run. Either this check can be done in the JT when a job is 
submitted, or a scheduler can fail the job if it detects this situation. 

Note that the monitoring process can be disabled if 
_mapred.tasktracker.virtualmemory.reserved_ is not present, or has some default 
negative value. 


Scheduling
-----------------

In order to prevent tasks using too much memory, a scheduler can ensure that it 
limits the number of tasks running on a node based on how much free memory is 
available and how much a task needs. The Capacity Scheduler will do this, 
though we cannot enforce all schedulers to support this feature. As per 
HADOOP-3759, TTs report, in each heartbeat, how much free VM they have (which 
is equal to _max-VM-per-node_ minus the sum of _max-VM-per-task_ for each 
running task). The Capacity Scheduler needs to ensure that: 
# there is enough VM for a new task to run. This it does by comparing the 
task's requirement (its _max-VM-per-task_ limit) to the free VM available in 
the TT. 
# there is enough RAM available for a task so that there is not a lot of page 
swapping and thrashing when tasks run. This is much harder to figure out and 
it's not even clear what it means to have 'enough RAM available' for a task. A 
simple proposal, to get us started, is to assume a fraction of the 
_max-VM-per-task_ limit as the 'RAM limit' for a task. Call this the 
_max-RAM-per-task_ limit, and think of it as a scheduling limit. For a task to 
be scheduled, its _max-RAM-per-task_ limit should be less than the total RAM on 
a TT minus the sum of _max-RAM-per-task_ limits of tasks running on the TT. 
This also implies that a TT should report its free RAM (the total RAM on the 
node minus the sum of the _max-RAM-per-task_ limits for each running task. 

Just as with the handling of VM, we may want to use a part of the RAM for 
scheduling TT tasks, and not all of it. If so, we can introduce a config value, 
_mapred.tasktracker.ram.reserved_, which indicates an offset (in MB?). The 
amount of RAM available to the TT tasks is then the total RAM on the machine, 
minus this offset. How do we get the total RAM on the machine? By the same 
plugin interface through which we obtain total VM. 

How do we specify a task's _max-RAM-per-task_ limit? There is a system-wide 
default value, _mapred.capacity-scheduler.default.ramlimit_, expressed as a 
percentage. A task's default _max-RAM-per-task_ limit is equal to the task's 
_max-VM-per-task_ limit times this value. We may start by setting 
_mapred.capacity-scheduler.default.ramlimit_ to 50 or 33%. In order to let 
individual jobs override this default, a job can set a config value, 
_mapred.task.maxram_, expressed in MB, which then becomes the task's 
_max-RAM-per-task_ limit. Furthermore, as with VM settings, we want to prevent 
users from setting _mapred.task.maxram_ to an arbitrarily high number and thus 
gaming the system. To do this, there should be a cluster-wide setting, 
_mapred.task.limit.maxram_, that limits the value of _mapred.task.maxram_. If 
_mapred.task.maxram_ is set to a value higher than _mapred.task.limit.maxram_, 
the job should not run. Either this check can be done in the JT when a job is 
submitted, or a scheduler can fail the job if it detects this situation. 

The Capacity Scheduler, when it picks a task to run, will check if both the 
task's RAM limit and VM limit can be satisfied. If so, the task is given to the 
TT. If not, nothing is given to the TT (i.e., the cluster blocks till at least 
one TT has enough memory). We will not block forever because we limit what the 
task can ask for, and these limits should be set lower than the RAM and VM on 
each TT. In order to tax users on their job's requirements, we may charge them 
for what the value they set per task, but for now, there is no penalty 
associated with the value set for mapred.task.maxmemory by a user for a job. 


Open issues
-------------------

Based on the writeup above, I'm summarizing a few of the open issues (mostly 
minor): 
# Should the memory-related config values be expressed in MB or GB or KB or 
just bytes? MB sounds good to me. 
# If a job's specified VM or RAM task limit is higher than the max limit, that 
job shouldn't be allowed to run. Should the JT reject the job when it is 
submitted, or should the scheduler do it, by failing the job? The argument for 
the former is that these limits apply to all schedulers, but then again, they 
are scheduling-based limits, so they maybe they should be done in each of the 
schedulers. In the latter case, if a scheduler does not support scheduling 
based on memory limits, it can just ignore these settings and run the job. So 
the latter option seems better. 
# Should the Capacity Scheduler use the entire RAM of a TT when making a 
scheduling decision, or an offset? Given that the RAM fractions are not very 
precise (they're based on fractions of the VM), an offset doesn't make much of 
a difference (you could tweak _mapred.capacity-scheduler.default.ramlimit_ to 
achieve what the offset would), and adds an extra config value. At the same 
time, part of the RAM is blocked for non-Hadoop stuff, and an offset does make 
things symmetrical. 


> Modify the capacity scheduler (HADOOP-3445) to schedule tasks based on memory 
> requirements and task trackers free memory
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4035
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4035
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>            Reporter: Hemanth Yamijala
>            Assignee: Vinod K V
>            Priority: Blocker
>             Fix For: 0.20.0
>
>         Attachments: 4035.1.patch, HADOOP-4035-20080918.1.txt, 
> HADOOP-4035-20081006.1.txt, HADOOP-4035-20081006.txt, HADOOP-4035-20081008.txt
>
>
> HADOOP-3759 introduced configuration variables that can be used to specify 
> memory requirements for jobs, and also modified the tasktrackers to report 
> their free memory. The capacity scheduler in HADOOP-3445 should schedule 
> tasks based on these parameters. A task that is scheduled on a TT that uses 
> more than the default amount of memory per slot can be viewed as effectively 
> using more than one slot, as it would decrease the amount of free memory on 
> the TT by more than the default amount while it runs. The scheduler should 
> make the used capacity account for this additional usage while enforcing 
> limits, etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to