Github user squito commented on the pull request:
https://github.com/apache/spark/pull/8760#issuecomment-142683197
Hi @mwws, I've been through in more detail, sorry if my comments have been
a bit scattered as I worked through understanding this. I have one more
high-level thought about the approach:
Its great that we can update yarn on the blacklisted nodes. However, do
the strategies also need someway of *immediately* killing the blacklisted
executors? Eg., say you request 5 nodes on a 100 node cluster, and in the
early phase of your app you discover that one of the executors is failing
repeatedly. The change proposed lets us go to scheduling tasks directly on the
4 good executors, but wouldn't you rather tell yarn you've got a bad executor,
kill it, and request another one? This could be left to the strategy to
choose, but seems like it should be possible and in one of the provided
implementations. In that case it wouldn't really make sense to have an expiry
time for the executors, since you just completely kill them, but I suppose you
would still have an expiry time for the node blacklist. Then you also
naturally escalate from assuming you've got one bad executor, and then if you
get another bad executor on the same node, you'd blacklist the entire node.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]