Github user squito commented on the issue:
https://github.com/apache/spark/pull/17088
@kayousterhout I don't think https://github.com/apache/spark/pull/14931 is
really a complete answer to this.
(a) we only get that from standalone mode, no other cluster managers (yarn
does not notify applications of failures of any node in the cluster)
(b) even in standalone mode, its [only notified if the app has an active
executor on the
node](https://demo.fluentcode.com/source/spark/master/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala?squery=Telling%20app%20of%20#L779).
However, with dynamic allocation, you may be serving shuffle files from a
node even though there are no active executors there anymore.
(c) it could be that the executors on the node appear to be responsive, but
shuffle files can't served anyhow -- maybe a disk has gone bad. The executor
blacklisting may eventually discover this, but it might not if the tasks don't
write to disk (so tasks keep executing successfully on the source of the fetch
failures); and even if it the tasks did write to disk, it would take a while
for the blacklisting to kick in, and you would still hit the scenario
originally described.
(Aside: (c) made me think more about whether we should be removing shuffle
data when we blacklist, both for executors and nodes ... I think the behavior
will be correct either way, but similar tradeoffs about which situation to
optimize for.)
I think that https://github.com/apache/spark/pull/14931 is just a small
optimization when possible, not a mechanism that can be relied upon.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]