Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/14162
I'd be curious if you find out what was wrong with that node.
If its the leveldb file not being created, that should be fixed by
https://github.com/apache/spark/commit/aab99d31a927adfa9216dd14e76493a187b6d6e7
which is supposed to use the approved recovery path and if that is bad I
believe the nodemanager and all services won't come up.
But ignoring the actual cause I think if we put this in we should make it
configurable, with default to not throw. From a YARN point of view I don't
necessarily want one bad service to take the entire cluster down. For
instance, lets say we have a bug in the spark shuffle services, we try to
deploy a 5000 node cluster, this change now causes none of the nodemanagers to
come up. But my workload on that cluster is such that spark is only like 1%.
I don't necessarily want that to block the other 99% of jobs on that cluster
while I try to fix the spark shuffle handler or roll it back.
This also should get better once we have the node blacklisting stuff in.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]