[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075194#comment-16075194
 ] 

Zhiyuan Yang commented on TEZ-3718:
-----------------------------------

Thanks [~sseth] for review! 
bq.  Instead of isUnhealthy in the event - this could be an enum (UNHEALTHY, 
BLACKLISTED)
Use enum in new patch.

bq. Don't change container state - allow a task action to change the state via 
a STOP_REQEUST depending on task level configs. OR If there are not running 
fragments on the container, change state to a COMPLETED state, so that new 
tasks allocations are not accepted.
In the third patch, I removed the 'do nothing' behavior previously introduced 
by TEZ-2972. The sole purpose of this behavior is to avoid unnecessary 
reschedule of completed task. With third patch we can do this more precisely, 
so there is no need to keep 'do nothing' behavior. Also the semantic of 
configuration is changed according. With this said, there is no issue about 
sending TA_CONTAINER_TERMINATING from container to TAImpl since running task 
should always be killed.

bq. Don't read from a Configuration instance within each AMContainer / 
TaskAttemptImpl - there's example code on how to avoid this in 
TaskImpl/TaskAttemptImpl
Didn't change this part.  Are you saying this because of locking issue? Even in 
TAImpl, configuration is read by each instance, just in a different place 
(ctor).

> Better handling of 'bad' nodes
> ------------------------------
>
>                 Key: TEZ-3718
>                 URL: https://issues.apache.org/jira/browse/TEZ-3718
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Zhiyuan Yang
>         Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to