[ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090848#comment-16090848
 ] 

Siddharth Seth commented on TEZ-3718:
-------------------------------------

Patch mostly looks good to me. More at the end though on potential changes.
- TaskAttemptEventNodeFailed - failure reason is still a boolean. This can be 
an enum as well.
- Config still being read in TAImpl. To your point about it already being read, 
that needs to be changed. Trying to make sure there's no access beyond the 
Vertex level at max. (Configuration has historically been slow to access). This 
should be a simple change via getVertex.getVertexConfig.
- Changing the config parameter node-unhealthy-reschedule-tasks is an 
incompatible change. Should be deprecated, and a new one introduced.

Filed TEZ-3799 to make blacklisting behave the same.

This does change behaviour to kill the currently running task on the container 
irrespective of the config setting. (With the previous setting, not only would 
old tasks not re-run, the current one would not be terminated either). 
[~jlowe], [~rohini] - based on the offline conversation we had about this, the 
preference was to have this configurable.
With the current patch, trying to make this configurable is a big change to the 
Container state machine. It's wired to complete in case of a node failure 
(which is correct IMHO), and if it completes, the running task will end up 
completing. 
One possible way to handle this. Retain old behaviour (AMNode will not send out 
events - and this can be covered by the current config). If this is enabled, 
the old behaviour continues, and the changes to Task are irrelevant (new 
configs don't apply). With the old config set to send Container termination 
messages, the new flags can kick in. The running task will be killed. For 
completed ones, fast exits can be enabled in case of Input errors.

> Better handling of 'bad' nodes
> ------------------------------
>
>                 Key: TEZ-3718
>                 URL: https://issues.apache.org/jira/browse/TEZ-3718
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Zhiyuan Yang
>         Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to