[
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116156#comment-15116156
]
Bikas Saha commented on TEZ-3072:
---------------------------------
A short term fix could be disabling the rerun of completed tasks (but
continuing to blacklist the node to avoid scheduling more work there). A longer
term effort towards putting machines in probation for a few cycles before
giving up on them might help prevent cliffs like this, specially due to
temporary glitches.
> Node blacklisting always reruns completed non-leaf tasks
> --------------------------------------------------------
>
> Key: TEZ-3072
> URL: https://issues.apache.org/jira/browse/TEZ-3072
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the
> user's code that caused a problem in one of the trailing vertices in the
> task. On some nodes enough tasks failed that the AM thought it needed to
> blacklist those nodes. That blacklisting then caused many completed vertices
> to re-run because it thought it needed to re-execute the non-leaf tasks that
> had completed on those nodes. This wasted a lot of cluster resources and job
> time for no benefit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)