[ 
https://issues.apache.org/jira/browse/YUNIKORN-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305472#comment-17305472
 ] 

Chaoran Yu edited comment on YUNIKORN-588 at 3/21/21, 12:46 PM:
----------------------------------------------------------------

[~wwei] Yeah I tried deleting the application CRD and then the placeholder pods 
will be gone. But often times we want to leave the driver pod there so that 
Spark developers can look at the pod logs to debug their jobs. So when the 
driver pod errors out, all placeholder pods associated with that Spark job 
should still be immediately terminated, rather than waiting until after the 
timeout, which could be long depending on the configuration


was (Author: yuchaoran2011):
[~wwei] Yeah I tried deleting the driver pod and then the placeholder pods will 
be gone. But often times we want to leave the driver pod there so that Spark 
developers can look at the pod logs to debug their jobs. So when the driver pod 
errors out, all placeholder pods associated with that Spark job should still be 
immediately terminated, rather than waiting until after the timeout, which 
could be long depending on the configuration

> Placeholder pods are not cleaned up timely when the Spark driver fails
> ----------------------------------------------------------------------
>
>                 Key: YUNIKORN-588
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-588
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>    Affects Versions: 0.10
>            Reporter: Chaoran Yu
>            Priority: Major
>              Labels: spark
>         Attachments: Screen Shot 2021-03-19 at 9.41.48 PM.png
>
>
> When a Spark job is gang scheduled, if the driver pod fails immediately upon 
> running (e.g. due to an error in the Spark application code), the placeholder 
> pods will still try to reserve resources. They won't be terminated until 
> after the configured timeout has passed, even though they should have been 
> cleaned up the moment that the driver failed. Because we already knew at that 
> point, none of the executors would have a chance to start. 
>  Something probably needs to be done at the Spark operator plugin level to 
> activate placeholder cleanup to release resources sooner.
> Edit: Actually a fix needs to be developed without the Spark operator plugin 
> because the user might not be using it. The Spark job could well have been 
> submitted via spark-submit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to