Chaoran Yu created YUNIKORN-588:
-----------------------------------
Summary: Placeholder pods are not cleaned up timely when the Spark
driver fails
Key: YUNIKORN-588
URL: https://issues.apache.org/jira/browse/YUNIKORN-588
Project: Apache YuniKorn
Issue Type: Bug
Components: shim - kubernetes
Affects Versions: 0.10
Reporter: Chaoran Yu
When a Spark job is gang scheduled, if the driver pod fails immediately upon
running (e.g. due to an error in the Spark application code), the placeholder
pods will still try to reserve resources. They won't be terminated until after
the configured timeout has passed, even though they should have been cleaned up
the moment that the driver failed. Because we already knew at that point, none
of the executors would have a chance to start.
Something probably needs to be done at the Spark operator plugin level to
activate placeholder cleanup to release resources sooner.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]