[
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115814#comment-15115814
]
Jason Lowe commented on TEZ-3072:
---------------------------------
We also have issues with temporary fetch failure issues with a node causing all
completed tasks from that node to re-run. In many ways the blacklisting logic
is causing more problems than it is solving, at least with respect to
fetch-failure related processing. It would be nice if we could configure
blacklisting to ignore node effects involving shuffle (e.g.; fetch failures are
not reported to the blacklisting logic, and blacklisted nodes don't cause
compelted tasks to re-run).
> Node blacklisting always reruns completed non-leaf tasks
> --------------------------------------------------------
>
> Key: TEZ-3072
> URL: https://issues.apache.org/jira/browse/TEZ-3072
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the
> user's code that caused a problem in one of the trailing vertices in the
> task. On some nodes enough tasks failed that the AM thought it needed to
> blacklist those nodes. That blacklisting then caused many completed vertices
> to re-run because it thought it needed to re-execute the non-leaf tasks that
> had completed on those nodes. This wasted a lot of cluster resources and job
> time for no benefit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)