[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005247#comment-16005247
 ] 

Siddharth Seth edited comment on TEZ-3718 at 5/10/17 7:15 PM:
--------------------------------------------------------------

bq. Not specifically related to bad node handling, but we could also improve 
fetch failure handling by taking the upstream task runtime into account when 
deciding how to handle failures. Does it really make sense to retry fetching 
for minutes when the upstream task can regenerate the data in a few seconds? 
I believe MapReduce may already have this factored into it's retry handling. 
Definitely makes sense to get something like this in as well.

In terms of the 'bad' indication, AMNode does track failures on a node. Again, 
the reason is not really known. The AM has a lot of information on what is 
going on in the cluster - transfer rate per node, execution rate per node, 
shuffle failures etc. It should, in theory, be able to make much better calls. 
Don't think we've every gotten around to getting all of this connected together 
in the AM though.

As long as the two suggestions on the jira don't sound absurd, planning to use 
this jira for these changes.

Another question is whether to treat TIMEOUTS as FAILURES or TASK_KILLS. A 
container timeout, is likely a KILL. OTOH, a task timeout, which actually makes 
use of a haartbeat every N records, is more likely to be a FAILURE.


was (Author: sseth):
bq. Not specifically related to bad node handling, but we could also improve 
fetch failure handling by taking the upstream task runtime into account when 
deciding how to handle failures. Does it really make sense to retry fetching 
for minutes when the upstream task can regenerate the data in a few seconds? 
I believe MapReduce may already have this factored into it's retry handling. 
Definitely makes sense to get something like this in as well.

In terms of the 'bad' indication, AMNode does track failures on a node. Again, 
the reason is not really known. The AM has a lot of information on what is 
going on in the cluster - transfer rate per node, execution rate per node, 
shuffle failures etc. It should, in theory, be able to make much better calls. 
Don't think we've every gotten around to getting all of this connected together 
in the AM though.

> Better handling of 'bad' nodes
> ------------------------------
>
>                 Key: TEZ-3718
>                 URL: https://issues.apache.org/jira/browse/TEZ-3718
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to