[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065786#comment-16065786
 ] 

Siddharth Seth commented on TEZ-3718:
-------------------------------------

I'm not sure why AMNodeIMpl treats NodeUnhealthy and NodeBlacklisted 
differently from each other w.r.t the config which determines whether tasks 
need to be restarted or not. I think both can be treated the same. [~jlowe] - 
you may have more context on this.

The changes to Node related classes mostly look good to me. Instead of 
isUnhealthy in the event - this could be an enum (UNHEALTHY, BLACKLISTED)

For AMContainer, not sure why the fail task config needs to be read. Will the 
following work.
- When event received, annotate the container to say "On A Failed Node" 
(Already done)
- Inform prior and current attempts of the node failure.
- Don't change container state - allow a task action to change the state via a 
STOP_REQEUST depending on task level configs.
  OR
  If there are not running fragments on the container, change state to a 
COMPLETED state, so that new tasks allocations are not accepted.
  Do not accept new tasks since nodeFailure has been set.
TaskAttempt
- From a brief glance, the functionality looks good. Fail_Fast / decide whether 
to keep a task / cause it to be killed on a node failure.

Genearl
- Don't read from a Configuration instance within each AMContainer / 
TaskAttemptImpl - there's example code on how to avoid this in 
TaskImpl/TaskAttemptImpl
- Thought the configs woiuld be the following
TEZ_AM_NODE_UNHEALTHY_RESCHEDULE_TASKS=false - Current, default=false
TEZ_AM_NODE_UNHEALTHY_KILL_RUNNING=true - New, default=true (overrides 
TEZ_AM_NODE_UNHEALTHY_RESCHEDULE_TASKS)
Third config in the patch looks good

> Better handling of 'bad' nodes
> ------------------------------
>
>                 Key: TEZ-3718
>                 URL: https://issues.apache.org/jira/browse/TEZ-3718
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Zhiyuan Yang
>         Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to