Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/3541#issuecomment-65309912
Note: I am ignoring deterministic failure reasons here (which will fail on
any host and usually points to bug in user or spark codebase).
Task failure could be due to a variety of transient reasons - which could
be directly related to the task in question, indirectly related to it, or even
completely unrelated to it.
For example:
- What other tasks are running on the executor and how it interacts with
the failed task.
- What data is currently cached on the executor and the impact it has on
resource utilization (rdd, broadcast, buffers, gc, etc).
- What the current state of the executor is (in process of shutdown, but
not yet informed the driver about it).
... among others.
Also note that when you have resource constraints enforced (particularly in
yarn - where memory limits are aggressively enforced) - one or more of the
above can interact with that to cause further non deterministic failures :
which is why we have more hacks like memory overheads to help alleviate (though
not eliminate) them.
Since we have limits on number of times an application can have executors
failures, number of times a task can fail before the application is failed, etc
- we need executor level blacklist.
Note, this does not mean we do not need host level blacklist ! I can
definitely see value in that if the issues above are host level - as pointed
out, lack of hdd space, bad memory or cpu, thermal issues, etc.
Ideally, as I mentioned in the past, we need a better way to identify and
blacklist executors/hosts/racks.
What we currently have is a stop gap hack - and upgrading that from
executor level to host level does not solve problems (it causes regressions
actually in our workloads - since we are not missing a replica completely for
dfs data and the other replicas might not be in our allocated hosts/executors).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]