[ 
https://issues.apache.org/jira/browse/TEZ-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949455#comment-14949455
 ] 

Bikas Saha commented on TEZ-2872:
---------------------------------

How about the following alternative solutions
1) https://issues.apache.org/jira/browse/TEZ-1371 / 
https://issues.apache.org/jira/browse/TEZ-754
2) https://issues.apache.org/jira/browse/TEZ-307
Essentially, trying to send much less to the container because payloads are 
identical across tasks of the same vertex (and usually partially over tasks of 
different vertices too). 
Approach in 1) would be similar to sending a crc/identifier to TezChild instead 
of the actual payload after the payload has been sent to it once over RPC. This 
is more dynamic.
Approach in 2) would be identifying upfront that payloads are large and so 
converting them to local resources and localizing them for task containers. 
Then replace payloads with LR identifiers and having task contexts reference 
identifiers instead of payloads.


> Tez AM can be overwhelmed by TezTaskUmbilicalProtocol.getTask responses
> -----------------------------------------------------------------------
>
>                 Key: TEZ-2872
>                 URL: https://issues.apache.org/jira/browse/TEZ-2872
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jason Lowe
>
> When a large job runs on a large cluster with a large user payload then the 
> AM can end up hitting OOM conditions.  For example, Pig-on-Tez can require a 
> significant user payload (approaching 1MB) for vertices, inputs, and outputs 
> in the DAG.  This can cause the ContainerTask response to be rather large per 
> task, which can lead to a situation where the AM is generating output faster 
> than the network interface can process it.  If there are enough containers 
> asking for tasks then this leads to an OOM condition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to