[ 
https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117350#comment-15117350
 ] 

Jason Lowe commented on TEZ-3072:
---------------------------------

In this particular case even blacklisting the node was the wrong thing to do 
because the node was irrelevant to the task failures.

I noticed that the code treats a node removed by YARN and a node blacklisted 
due to attempt failures equally.  I could see that being problematic in 
practice, because a node that is failing tasks could serve up data from shuffle 
just fine.  Re-running completed tasks would only help iff the shuffle would be 
problematic.  I suspect in most cases the completed re-runs to avoid the  
theoretical possibility we could have shuffle problems (without even trying to 
verify the problem exists) makes the job slower than just assuming the shuffle 
might work and let normal fetch failure handling take care of the problem.  
Yes, there's going to be pathological cases where predictive re-execution would 
drastically speed up the job, but we're seeing plenty of cases where this 
preemptive strike against potential shuffle problems is causing much more harm. 
 Saw another case of this yesterday where a job re-ran dozens of tasks from 
upstream completed vertices for no benefit.


> Node blacklisting always reruns completed non-leaf tasks
> --------------------------------------------------------
>
>                 Key: TEZ-3072
>                 URL: https://issues.apache.org/jira/browse/TEZ-3072
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>
> Recently a user ran a job with many vertices, and there was a bug in the 
> user's code that caused a problem in one of the trailing vertices in the 
> task.  On some nodes enough tasks failed that the AM thought it needed to 
> blacklist those nodes.  That blacklisting then caused many completed vertices 
> to re-run because it thought it needed to re-execute the non-leaf tasks that 
> had completed on those nodes.  This wasted a lot of cluster resources and job 
> time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to