[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116156#comment-15116156
 ] 

Bikas Saha commented on TEZ-3072:
---------------------------------

A short term fix could be disabling the rerun of completed tasks (but 
continuing to blacklist the node to avoid scheduling more work there). A longer 
term effort towards putting machines in probation for a few cycles before 
giving up on them might help prevent cliffs like this, specially due to 
temporary glitches.

> Node blacklisting always reruns completed non-leaf tasks
> --------------------------------------------------------
>
>                 Key: TEZ-3072
>                 URL: https://issues.apache.org/jira/browse/TEZ-3072
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to