[ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067111#comment-16067111 ]
Jason Lowe commented on TEZ-3718: --------------------------------- bq. I'm not sure why AMNodeIMpl treats NodeUnhealthy and NodeBlacklisted differently from each other w.r.t the config which determines whether tasks need to be restarted or not. I'm not sure I know the full history in this, but unhealthy vs. blacklisted can represent different contexts. For example, a node could go unhealthy because too many disks have failed. We probably want to proactively re-run upstream tasks rather than wait for the fetch failures. Node blacklisting is a little different than unhealthy nodes. If a task runs and fails before completing, we may want to blacklist that node to prevent other tasks from also failing to complete on that node. But if we have _completed_ tasks on that blacklisted node then there's a decent chance we can complete the shuffle despite the fact that tasks are failing. For example, task needs to use a GPU but something about the GPU setup on that node causes all tasks trying to use it to crash. If tasks that didn't need the GPU ran and succeeded on the node, why are we proactively re-running them rather than just fetching their inputs? That could be a huge waste of work and end up being a large performance hit to the job. TEZ-3072 was filed because of behavior like this. It all comes down to these two questions: - if a node is unhealthy, is it likely I won't be able to successfully shuffle data from it? - if a node is blacklisted, is it likely I won't be able to successfully shuffle data from it? If the answer to both of them is always the same regarding whether we re-run completed tasks then yes, we should treat them equivalently. I think there are cases where we would want them to be different. Full disclosure -- we've been running in a mode where we do _not_ re-run completed tasks on nodes if they go unhealthy or are blacklisted. We found many cases where the node was still able to shuffle most (sometimes all) of the completed data for tasks despite being declared unhealthy or blacklisted. In short, re-running was causing more problems than it was fixing for us, so now we simply wait for the fetch failures. It's not always optimal, of course, and there are cases where proactively re-running would have been preferable. > Better handling of 'bad' nodes > ------------------------------ > > Key: TEZ-3718 > URL: https://issues.apache.org/jira/browse/TEZ-3718 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Assignee: Zhiyuan Yang > Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch > > > At the moment, the default behaviour in case of a node being marked bad is to > do nothing other than not schedule new tasks on this node. > The alternate, via config, is to retroactively kill every task which ran on > the node, which causes far too many unnecessary re-runs. > Proposing the following changes. > 1. KILL fragments which are currently in the RUNNING state (instead of > relying on a timeout which leads to the attempt being marked as FAILED after > the timeout interval. > 2. Keep track of these failed nodes, and use this as input to the failure > heuristics. Normally source tasks require multiple consumers to report > failure for them to be marked as bad. If a single consumer reports failure > against a source which ran on a bad node, consider it bad and re-schedule > immediately. (Otherwise failures can take a while to propagate, and jobs get > a lot slower). > [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions. > What I'm seeing is retroactive failures taking a long time to apply, and > restart sources which ran on a bad node. Also running tasks being counted as > FAILURES instead of KILLS. -- This message was sent by Atlassian JIRA (v6.4.14#64029)