[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147747#comment-15147747
 ] 

Bikas Saha commented on TEZ-3072:
---------------------------------

When we handle node decommissioning this may be partly relevant. Eg. in that 
case we could send inputfailed to consumers. However that's a discussion for 
the future.

I am +1 for the changes in the taskattemptimpl. Blacklisting a node should not 
arbitrarily rerun all completed attempts on that node because downstream 
consumers may have already finished processing. We should probably rename the 
config to signify this aspect - e.g. bad_node_rerun_attempts and give it a 
default of false.

However, I would like to be cautious about the changes in taskimpl. If a task 
has been marked as failed retroactively then it implies that consumers have 
reported enough errors against it. And also, after this that attempt will be 
retried. So informing the node about this seems the right thing to do. It is 
likely that a number of such errors may indicate issues with that node, some of 
which may be temporary. With TEZ-3075, which would temporarily decommission the 
nodes, we should be able to handle the temporary cases. But getting the 
information about failures (including fetch failures) is important to make the 
decisions at the node level. Hence, IMO we should not make the change proposed 
in TaskImpl. If such a change is needed, then it could be made in 
AMNode/AMNodeTracker logic that handles AMNodeEventTaskAttemptEnded. There we 
could filter attempt failures by type and ignore fetch failures (based on a 
separate config). Or we could postpone that change in preference to TEZ-3075.

Separately, AMNodeEventTaskAttemptEnded seems to be sent from TaskScheduler and 
TaskImpl whereas it could be sent from a single source in TaskAttemptImpl. The 
current approach is open to getting out of sync. 

> Node blacklisting always reruns completed non-leaf tasks
> --------------------------------------------------------
>
>                 Key: TEZ-3072
>                 URL: https://issues.apache.org/jira/browse/TEZ-3072
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: TEZ-3072.001.patch
>
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to