[ 
https://issues.apache.org/jira/browse/TEZ-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panagiotis Garefalakis updated TEZ-4183:
----------------------------------------
    Description: 
Time based batching can put lot of pressure in AM's memory as the failedEvents 
hashmap can grow fast 
https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L951

To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
Threshold send the message immediately to AM (instead of waiting)

  was:
Fetcher currently sends failure events to AM as soon as they are discovered:
https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L930

To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
Threshold send the message immediately to AM (instead of waiting)


> Time- and threshold-batched FetchFailure event propagation to AM
> ----------------------------------------------------------------
>
>                 Key: TEZ-4183
>                 URL: https://issues.apache.org/jira/browse/TEZ-4183
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Panagiotis Garefalakis
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>         Attachments: TEZ-4183.01.patch, TEZ-4183.02.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Time based batching can put lot of pressure in AM's memory as the 
> failedEvents hashmap can grow fast 
> https://github.com/apache/tez/blob/354c2a4177fe8c3cf6b8a4c6009d4068a19d81f1/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L951
> To reduce AM pressure we can: 1) Batch fetch failure events to be sent 
> periodically (every BATCH_WAIT) and 2) if we see disk errors more than a 
> Threshold send the message immediately to AM (instead of waiting)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to