[
https://issues.apache.org/jira/browse/YUNIKORN-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kinga Marton reassigned YUNIKORN-588:
-------------------------------------
Assignee: (was: Kinga Marton)
> Placeholder pods are not cleaned up timely when the Spark driver fails
> ----------------------------------------------------------------------
>
> Key: YUNIKORN-588
> URL: https://issues.apache.org/jira/browse/YUNIKORN-588
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: shim - kubernetes
> Affects Versions: 0.10
> Reporter: Chaoran Yu
> Priority: Major
> Labels: spark
> Attachments: Screen Shot 2021-03-19 at 9.41.48 PM.png
>
>
> When a Spark job is gang scheduled, if the driver pod fails immediately upon
> running (e.g. due to an error in the Spark application code), the placeholder
> pods will still try to reserve resources. They won't be terminated until
> after the configured timeout has passed, even though they should have been
> cleaned up the moment that the driver failed. Because we already knew at that
> point, none of the executors would have a chance to start.
> Something probably needs to be done at the Spark operator plugin level to
> activate placeholder cleanup to release resources sooner.
> Edit: Actually a fix needs to be developed without the Spark operator plugin
> because the user might not be using it. The Spark job could well have been
> submitted via spark-submit.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]