Zhou JIANG created SPARK-53706:
----------------------------------
Summary: App reconcile steps should properly handle exceptions in
status update
Key: SPARK-53706
URL: https://issues.apache.org/jira/browse/SPARK-53706
Project: Spark
Issue Type: Sub-task
Components: Kubernetes
Affects Versions: kubernetes-operator-0.5.0
Reporter: Zhou JIANG
SparkAppReconciler is not handling exceptions when updating app status - these
failed reconcile loops may end up with endless retry if the status update is
caused by conflicts.
For example, we observe exceptions like these when app is starting
```
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
executing: PUT at:
https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status.
Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org
"spark-example-retain-duration": the object has been modified; please apply
your changes to the latest version and try again. Received status:
Status(apiVersion=v1, code=409, details=StatusDetails(causes=[],
group=spark.apache.org, kind=sparkapplications,
name=spark-example-retain-duration, retryAfterSeconds=null, uid=null,
additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on
sparkapplications.spark.apache.org "spark-example-retain-duration": the object
has been modified; please apply your changes to the latest version and try
again, metadata=ListMeta(_continue=null, remainingItemCount=null,
resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict,
status=Failure, additionalProperties={}).
at
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642)
~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622)
~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582)
~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
~[?:?]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
~[?:?]
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
~[?:?]
at
io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141)
~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
```
Why is this happening ? Because reconcile can be triggered again by driver pod
status update while another reconcile is in-progress. Without proper exception
handling, this would keep recurring.
We'd better digest this better: if an exception is thrown while updating app
status (which is typically at the last of each reconcile) - operator shall
properly finish this reconcile loop and start a new one. App status is fetched
from cache at the beginning of each reconcile - and our reconcile steps are
ready designed to be idempotent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]