[
https://issues.apache.org/jira/browse/SPARK-53706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun reassigned SPARK-53706:
-------------------------------------
Assignee: Zhou JIANG
> App reconcile steps should properly handle exceptions in status update
> -----------------------------------------------------------------------
>
> Key: SPARK-53706
> URL: https://issues.apache.org/jira/browse/SPARK-53706
> Project: Spark
> Issue Type: Sub-task
> Components: Kubernetes
> Affects Versions: kubernetes-operator-0.5.0
> Reporter: Zhou JIANG
> Assignee: Zhou JIANG
> Priority: Major
> Labels: pull-request-available
>
> SparkAppReconciler is not handling exceptions when updating app status -
> these failed reconcile loops may end up with endless retry if the status
> update is caused by conflicts.
>
> For example, we observe exceptions like these when app is starting
>
> ```
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: PUT at:
> https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status.
> Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org
> "spark-example-retain-duration": the object has been modified; please apply
> your changes to the latest version and try again. Received status:
> Status(apiVersion=v1, code=409, details=StatusDetails(causes=[],
> group=spark.apache.org, kind=sparkapplications,
> name=spark-example-retain-duration, retryAfterSeconds=null, uid=null,
> additionalProperties={}), kind=Status, message=Operation cannot be fulfilled
> on sparkapplications.spark.apache.org "spark-example-retain-duration": the
> object has been modified; please apply your changes to the latest version and
> try again, metadata=ListMeta(_continue=null, remainingItemCount=null,
> resourceVersion=null, selfLink=null, additionalProperties={}),
> reason=Conflict, status=Failure, additionalProperties={}).
> at
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642)
> ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> at
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622)
> ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> at
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582)
> ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> at
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
> ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
> ~[?:?]
> at
> io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141)
> ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> ```
>
> Why is this happening ? Because reconcile can be triggered again by driver
> pod status update while another reconcile is in-progress. Without proper
> exception handling, this would keep recurring.
> We'd better digest this better: if an exception is thrown while updating app
> status (which is typically at the last of each reconcile) - operator shall
> properly finish this reconcile loop and start a new one. App status is
> fetched from cache at the beginning of each reconcile - and our reconcile
> steps are ready designed to be idempotent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]