[
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075194#comment-16075194
]
Zhiyuan Yang edited comment on TEZ-3718 at 7/5/17 8:40 PM:
-----------------------------------------------------------
Thanks [~sseth] for review!
bq. Instead of isUnhealthy in the event - this could be an enum (UNHEALTHY,
BLACKLISTED)
Use enum in new patch.
bq. Don't change container state - allow a task action to change the state via
a STOP_REQEUST depending on task level configs. OR If there are not running
fragments on the container, change state to a COMPLETED state, so that new
tasks allocations are not accepted.
In the third patch, I removed the 'do nothing' behavior previously introduced
by TEZ-2972. The sole purpose of this behavior is to avoid unnecessary
reschedule of completed task. With third patch we can do this more precisely,
so there is no need to keep 'do nothing' behavior. Also the semantic of
configuration is changed according. With this said, there is no issue about
sending TA_CONTAINER_TERMINATING from container to TAImpl since running task
should always be killed. This is also what we do in other error handling
transitions like LaunchFailedTransition.
bq. Don't read from a Configuration instance within each AMContainer /
TaskAttemptImpl - there's example code on how to avoid this in
TaskImpl/TaskAttemptImpl
Didn't change this part. Are you saying this because of locking issue? Even in
TAImpl, configuration is read by each instance, just in a different place
(ctor).
was (Author: aplusplus):
Thanks [~sseth] for review!
bq. Instead of isUnhealthy in the event - this could be an enum (UNHEALTHY,
BLACKLISTED)
Use enum in new patch.
bq. Don't change container state - allow a task action to change the state via
a STOP_REQEUST depending on task level configs. OR If there are not running
fragments on the container, change state to a COMPLETED state, so that new
tasks allocations are not accepted.
In the third patch, I removed the 'do nothing' behavior previously introduced
by TEZ-2972. The sole purpose of this behavior is to avoid unnecessary
reschedule of completed task. With third patch we can do this more precisely,
so there is no need to keep 'do nothing' behavior. Also the semantic of
configuration is changed according. With this said, there is no issue about
sending TA_CONTAINER_TERMINATING from container to TAImpl since running task
should always be killed.
bq. Don't read from a Configuration instance within each AMContainer /
TaskAttemptImpl - there's example code on how to avoid this in
TaskImpl/TaskAttemptImpl
Didn't change this part. Are you saying this because of locking issue? Even in
TAImpl, configuration is read by each instance, just in a different place
(ctor).
> Better handling of 'bad' nodes
> ------------------------------
>
> Key: TEZ-3718
> URL: https://issues.apache.org/jira/browse/TEZ-3718
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Zhiyuan Yang
> Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of
> relying on a timeout which leads to the attempt being marked as FAILED after
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure
> heuristics. Normally source tasks require multiple consumers to report
> failure for them to be marked as bad. If a single consumer reports failure
> against a source which ran on a bad node, consider it bad and re-schedule
> immediately. (Otherwise failures can take a while to propagate, and jobs get
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and
> restart sources which ran on a bad node. Also running tasks being counted as
> FAILURES instead of KILLS.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)