[ 
https://issues.apache.org/jira/browse/SPARK-53706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-53706:
-------------------------------------

    Assignee: Zhou JIANG

> App reconcile steps should properly handle exceptions in status update 
> -----------------------------------------------------------------------
>
>                 Key: SPARK-53706
>                 URL: https://issues.apache.org/jira/browse/SPARK-53706
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Kubernetes
>    Affects Versions: kubernetes-operator-0.5.0
>            Reporter: Zhou JIANG
>            Assignee: Zhou JIANG
>            Priority: Major
>              Labels: pull-request-available
>
> SparkAppReconciler is not handling exceptions when updating app status - 
> these failed reconcile loops may end up with endless retry if the status 
> update is caused by conflicts.
>  
> For example, we observe exceptions like these when app is starting
>  
> ```
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status.
>  Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org 
> "spark-example-retain-duration": the object has been modified; please apply 
> your changes to the latest version and try again. Received status: 
> Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], 
> group=spark.apache.org, kind=sparkapplications, 
> name=spark-example-retain-duration, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Operation cannot be fulfilled 
> on sparkapplications.spark.apache.org "spark-example-retain-duration": the 
> object has been modified; please apply your changes to the latest version and 
> try again, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Conflict, status=Failure, additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642)
>  ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622)
>  ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582)
>  ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> at 
> io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
>  ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
>  ~[?:?]
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
>  ~[?:?]
> at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) 
> ~[?:?]
> at 
> io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141)
>  ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
> ```
>  
> Why is this happening ? Because reconcile can be triggered again by driver 
> pod status update while another reconcile is in-progress. Without proper 
> exception handling, this would keep recurring.
> We'd better digest this better: if an exception is thrown while updating app 
> status (which is typically at the last of each reconcile) - operator shall 
> properly finish this reconcile loop and start a new one. App status is 
> fetched from cache at the beginning of each reconcile - and our reconcile 
> steps are ready designed to be idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to