[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003529#comment-16003529
 ] 

Jason Lowe commented on TEZ-3718:
---------------------------------

Sure, killing tasks that are active on a node that as marked bad can make 
sense, and it also makes sense to be more sensitive to rescheduling upstream 
tasks when downstream tasks start reporting failures given we already have 
other evidence the node is bad.  I'm not a big fan of declaring all upstream 
tasks bad that ran on the node since this often creates as many problems as it 
solves.  In some cases a node can be declared unhealthy but still be able to 
serve up shuffle data.  Unfortunately the 'bad' indication is just a boolean so 
we don't get the required fidelity to know whether re-running all tasks really 
makes sense.

Not specifically related to bad node handling, but we could also improve fetch 
failure handling by taking the upstream task runtime into account when deciding 
how to handle failures.  Does it really make sense to retry fetching for 
minutes when the upstream task can regenerate the data in a few seconds?  On 
the flip side, it might make sense to try a bit harder depending upon the type 
of failure (e.g.: read timeouts for slow nodes) when we suspect it will take 
hours to complete a reschedule of a task.

> Better handling of 'bad' nodes
> ------------------------------
>
>                 Key: TEZ-3718
>                 URL: https://issues.apache.org/jira/browse/TEZ-3718
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to