[
https://issues.apache.org/jira/browse/YUNIKORN-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305472#comment-17305472
]
Chaoran Yu edited comment on YUNIKORN-588 at 3/21/21, 12:46 PM:
----------------------------------------------------------------
[~wwei] Yeah I tried deleting the application CRD and then the placeholder pods
will be gone. But often times we want to leave the driver pod there so that
Spark developers can look at the pod logs to debug their jobs. So when the
driver pod errors out, all placeholder pods associated with that Spark job
should still be immediately terminated, rather than waiting until after the
timeout, which could be long depending on the configuration
was (Author: yuchaoran2011):
[~wwei] Yeah I tried deleting the driver pod and then the placeholder pods will
be gone. But often times we want to leave the driver pod there so that Spark
developers can look at the pod logs to debug their jobs. So when the driver pod
errors out, all placeholder pods associated with that Spark job should still be
immediately terminated, rather than waiting until after the timeout, which
could be long depending on the configuration
> Placeholder pods are not cleaned up timely when the Spark driver fails
> ----------------------------------------------------------------------
>
> Key: YUNIKORN-588
> URL: https://issues.apache.org/jira/browse/YUNIKORN-588
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: shim - kubernetes
> Affects Versions: 0.10
> Reporter: Chaoran Yu
> Priority: Major
> Labels: spark
> Attachments: Screen Shot 2021-03-19 at 9.41.48 PM.png
>
>
> When a Spark job is gang scheduled, if the driver pod fails immediately upon
> running (e.g. due to an error in the Spark application code), the placeholder
> pods will still try to reserve resources. They won't be terminated until
> after the configured timeout has passed, even though they should have been
> cleaned up the moment that the driver failed. Because we already knew at that
> point, none of the executors would have a chance to start.
> Something probably needs to be done at the Spark operator plugin level to
> activate placeholder cleanup to release resources sooner.
> Edit: Actually a fix needs to be developed without the Spark operator plugin
> because the user might not be using it. The Spark job could well have been
> submitted via spark-submit.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]