[
https://issues.apache.org/jira/browse/YUNIKORN-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305673#comment-17305673
]
Chaoran Yu edited comment on YUNIKORN-588 at 3/21/21, 12:46 PM:
----------------------------------------------------------------
I have experimented a bit more and found the following solution that works for
me. In the Spark operator app mgmt plugin, whenever the SparkApplication CRD
informer detects a change in the SparkApp state, we check whether the new state
is the {{Failed}} state. If so, we trigger a {{FailApplication}} event using
the app mgmt protocol. Note that currently the
{{ApplicationManagementProtocol}} interface lacks a {{NotifyApplicationFail}}
method, which needs to be added. Then in {{cache/application.go}}, we can add a
new event handler (e.g. {{onFail}}) that uses the placeholder manager to
initiate a placeholder cleanup for the given app. With such an implementation,
I have tested that as soon as the driver pod errors out, all placeholders are
immediately released.
This is a workaround for users of the Spark operator. For those who run Spark
jobs without it, maybe they can implement some logic that monitors the driver
pod status and react to it accordingly.
was (Author: yuchaoran2011):
I have experimented a bit more and found the following solution that works for
me. In the Spark operator app mgmt plugin, whenever the SparkApplication CRD
informer detects a change in the SparkApp state, we check whether the new state
is the {{Failed}} state. If so, we trigger a {{FailApplication}} event using
the app mgmt protocol. Note that currently the
{{ApplicationManagementProtocol}} interface lacks a {{NotifyApplicationFail}}
method, which needs to be added. Then in {{cache/application.go}}, we can add a
new event handler (e.g. {{onFail}}) that uses the placeholder manager to
initiate a placeholder cleanup for the given app. With such an implementation,
I have tested that as soon as the driver pod errors out, all placeholders are
immediately released.
This is a workaround for users of the Spark application. For those who run
Spark jobs without it, maybe they can implement some logic that monitors the
driver pod status and react to it accordingly.
> Placeholder pods are not cleaned up timely when the Spark driver fails
> ----------------------------------------------------------------------
>
> Key: YUNIKORN-588
> URL: https://issues.apache.org/jira/browse/YUNIKORN-588
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: shim - kubernetes
> Affects Versions: 0.10
> Reporter: Chaoran Yu
> Priority: Major
> Labels: spark
> Attachments: Screen Shot 2021-03-19 at 9.41.48 PM.png
>
>
> When a Spark job is gang scheduled, if the driver pod fails immediately upon
> running (e.g. due to an error in the Spark application code), the placeholder
> pods will still try to reserve resources. They won't be terminated until
> after the configured timeout has passed, even though they should have been
> cleaned up the moment that the driver failed. Because we already knew at that
> point, none of the executors would have a chance to start.
> Something probably needs to be done at the Spark operator plugin level to
> activate placeholder cleanup to release resources sooner.
> Edit: Actually a fix needs to be developed without the Spark operator plugin
> because the user might not be using it. The Spark job could well have been
> submitted via spark-submit.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]