[
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035239#comment-16035239
]
Siddharth Seth commented on TEZ-3718:
-------------------------------------
Can we do this a little differently. Typically Entities send out events, and
the receiving entity takes a decision on what to do. In this case, Nodes would
always send out an event. Container and/or TaskAttempt, based on state and
Configuration, would take a call on what to do next.
The event from Node may still need some augmenting to indicate whether the node
was blacklisted or marked "UNHEALTHY" for some other reason.
In the current patch, I'm not sure why TaskAttempt needs to look up the
configuration. Can avoid accessing the nodeTracker node status, and rely upon
the event instead.
> Better handling of 'bad' nodes
> ------------------------------
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of
> relying on a timeout which leads to the attempt being marked as FAILED after
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure
> heuristics. Normally source tasks require multiple consumers to report
> failure for them to be marked as bad. If a single consumer reports failure
> against a source which ran on a bad node, consider it bad and re-schedule
> immediately. (Otherwise failures can take a while to propagate, and jobs get
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and
> restart sources which ran on a bad node. Also running tasks being counted as
> FAILURES instead of KILLS.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)