[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

Jason Lowe (JIRA) Wed, 28 Jun 2017 12:38:20 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067111#comment-16067111
 ]


Jason Lowe commented on TEZ-3718:
---------------------------------

bq. I'm not sure why AMNodeIMpl treats NodeUnhealthy and NodeBlacklisted 
differently from each other w.r.t the config which determines whether tasks 
need to be restarted or not.

I'm not sure I know the full history in this, but unhealthy vs. blacklisted can 
represent different contexts.  For example, a node could go unhealthy because 
too many disks have failed.  We probably want to proactively re-run upstream 
tasks rather than wait for the fetch failures.  Node blacklisting is a little 
different than unhealthy nodes.  If a task runs and fails before completing, we 
may want to blacklist that node to prevent other tasks from also failing to 
complete on that node.  But if we have _completed_ tasks on that blacklisted 
node then there's a decent chance we can complete the shuffle despite the fact 
that tasks are failing.  For example, task needs to use a GPU but something 
about the GPU setup on that node causes all tasks trying to use it to crash.  
If tasks that didn't need the GPU ran and succeeded on the node, why are we 
proactively re-running them rather than just fetching their inputs?  That could 
be a huge waste of work and end up being a large performance hit to the job.  
TEZ-3072 was filed because of behavior like this.  It all comes down to these 
two questions:
- if a node is unhealthy, is it likely I won't be able to successfully shuffle 
data from it?
- if a node is blacklisted, is it likely I won't be able to successfully 
shuffle data from it?

If the answer to both of them is always the same regarding whether we re-run 
completed tasks then yes, we should treat them equivalently.  I think there are 
cases where we would want them to be different.  Full disclosure -- we've been 
running in a mode where we do _not_ re-run completed tasks on nodes if they go 
unhealthy or are blacklisted.  We found many cases where the node was still 
able to shuffle most (sometimes all) of the completed data for tasks despite 
being declared unhealthy or blacklisted.  In short, re-running was causing more 
problems than it was fixing for us, so now we simply wait for the fetch 
failures.  It's not always optimal, of course, and there are cases where 
proactively re-running would have been preferable.

> Better handling of 'bad' nodes
> ------------------------------
>
>                 Key: TEZ-3718
>                 URL: https://issues.apache.org/jira/browse/TEZ-3718
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Zhiyuan Yang
>         Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to 
> do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on 
> the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of 
> relying on a timeout which leads to the attempt being marked as FAILED after 
> the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure 
> heuristics. Normally source tasks require multiple consumers to report 
> failure for them to be marked as bad. If a single consumer reports failure 
> against a source which ran on a bad node, consider it bad and re-schedule 
> immediately. (Otherwise failures can take a while to propagate, and jobs get 
> a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and 
> restart sources which ran on a bad node. Also running tasks being counted as 
> FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

Reply via email to