Zhou JIANG created SPARK-53706:
----------------------------------

             Summary: App reconcile steps should properly handle exceptions in 
status update 
                 Key: SPARK-53706
                 URL: https://issues.apache.org/jira/browse/SPARK-53706
             Project: Spark
          Issue Type: Sub-task
          Components: Kubernetes
    Affects Versions: kubernetes-operator-0.5.0
            Reporter: Zhou JIANG


SparkAppReconciler is not handling exceptions when updating app status - these 
failed reconcile loops may end up with endless retry if the status update is 
caused by conflicts.

 

For example, we observe exceptions like these when app is starting

 

```
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: PUT at: 
https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status.
 Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org 
"spark-example-retain-duration": the object has been modified; please apply 
your changes to the latest version and try again. Received status: 
Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], 
group=spark.apache.org, kind=sparkapplications, 
name=spark-example-retain-duration, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on 
sparkapplications.spark.apache.org "spark-example-retain-duration": the object 
has been modified; please apply your changes to the latest version and try 
again, metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, 
status=Failure, additionalProperties={}).
at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
 ~[?:?]
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) 
~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) 
~[?:?]
at 
io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
```

 

Why is this happening ? Because reconcile can be triggered again by driver pod 
status update while another reconcile is in-progress. Without proper exception 
handling, this would keep recurring.

We'd better digest this better: if an exception is thrown while updating app 
status (which is typically at the last of each reconcile) - operator shall 
properly finish this reconcile loop and start a new one. App status is fetched 
from cache at the beginning of each reconcile - and our reconcile steps are 
ready designed to be idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to