[ 
https://issues.apache.org/jira/browse/TEZ-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated TEZ-2872:
----------------------------
    Attachment: TEZ-2872.gettask-governor.patch

Here's a crude patch that allows the client to configure a maximum number of 
tasks that will be launched per second.  If the maximum would be exceeded it 
simply returns null for the getTask() call and expects the container to poll 
again later.


> Tez AM can be overwhelmed by TezTaskUmbilicalProtocol.getTask responses
> -----------------------------------------------------------------------
>
>                 Key: TEZ-2872
>                 URL: https://issues.apache.org/jira/browse/TEZ-2872
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jason Lowe
>         Attachments: TEZ-2872.gettask-governor.patch
>
>
> When a large job runs on a large cluster with a large user payload then the 
> AM can end up hitting OOM conditions.  For example, Pig-on-Tez can require a 
> significant user payload (approaching 1MB) for vertices, inputs, and outputs 
> in the DAG.  This can cause the ContainerTask response to be rather large per 
> task, which can lead to a situation where the AM is generating output faster 
> than the network interface can process it.  If there are enough containers 
> asking for tasks then this leads to an OOM condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to