[ 
https://issues.apache.org/jira/browse/HADOOP-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689061#action_12689061
 ] 

Vinod K V commented on HADOOP-5568:
-----------------------------------

The primary observation while testing TaskMemoryManager is that it is not able 
to prevent nodes from going down when rogue tasks start consuming memory. It 
currently does the following:
 - It monitors the memory usage of each task (the task jvm and the descendant 
processes), and makes sure that the task is failed if the task goes beyond its 
memory reqs(specified via mapred.task.maxvmem).
 - Further, it also monitors the memory usage of all tasks running on a TT and 
makes sure that cumulative memory usage doesn't cross a specific limit (Total 
TT Vmem less mapred.tasktracker.vmem.reserved) by killing the least-progress 
tasks to bring down the memory usage.

The per-task monitoring is working fine with tasks growing at a moderate rate 
of till/around 100MB/sec. There are problems with the cumulative-usage 
monitoring.
 - The limit mapred.task.limit.maxvmem is supposed originally to prevent jobs 
from asking too much memory. If a single task asks for memory nearing the total 
usable Vmem on the TT, we don't prevent the task from running and as of now 
just log at warn level in the TT if it crosses mapred.task.limit.maxvmem. This 
is very problematic without any support for memory-based scheduling as tasks 
can potentially bring down nodes and we have seen instances of this.
 - Even if the tasks are withing limits, as mapred.task.limit.maxvmem is really 
not enforced, cumulative usage near total usable Vmem on the TT brings down the 
node and we have seen instances of this too.

> TaskMemoryManager not enforcing memory limits in the presence of rogue tasks
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-5568
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5568
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Vinod K V
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to