(spark-kubernetes-operator) branch main updated: [SPARK-55421] Increase `livenessProbe.failureThreshold` to 3

dongjoon Sat, 07 Feb 2026 18:57:43 -0800

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git



The following commit(s) were added to refs/heads/main by this push:
     new 7261b0a  [SPARK-55421] Increase `livenessProbe.failureThreshold` to 3
7261b0a is described below

commit 7261b0a179395e3ef1c7bf5e19b5d2b1356c21a8
Author: Dongjoon Hyun <[email protected]>
AuthorDate: Sat Feb 7 18:57:27 2026 -0800

    [SPARK-55421] Increase `livenessProbe.failureThreshold` to 3
    
    ### What changes were proposed in this pull request?
    
    This PR aims to increase `livenessProbe.failureThreshold` to 3.
    
    ### Why are the changes needed?
    
    SPARK-54328 introduce a severe race condition which causes the `Apache 
Spark Operator` restarts too frequently. Technically, `WebSocket` is supposed 
to restart regularly and the reconnect time is scheduled after 1000ms. 
`HealthProbe` checks during this reconnection and kills the operator because 
SPARK-54328 used  `failureThreshold=1` which means no failure is allowed. Here 
is the full log.
    
    - https://github.com/apache/spark-kubernetes-operator/pull/417
    
    **Spark Operator Restart Log**
    ```
    26/02/08 01:59:42 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
    26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
HEALTHY for type: SparkApplication, namespace: default, details [is running: 
true, has synced: true, is watching: true]
    26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
HEALTHY for type: Pod, namespace: default, details [is running: true, has 
synced: true, is watching: true]
    26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
HEALTHY for type: SparkCluster, namespace: default, details [is running: true, 
has synced: true, is watching: true]
    26/02/08 01:59:42 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
HEALTHY for type: Pod, namespace: default, details [is running: true, has 
synced: true, is watching: true]
    26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the 
current watch
    26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.WatcherWebSocketListener WebSocket 
close received. code: 1000, reason: null
    26/02/08 01:59:51 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling 
reconnect task in 1000 ms
    26/02/08 01:59:52 DEBUG   o.a.s.k.o.p.HealthProbe Checking informer health
    26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
HEALTHY for type: SparkApplication, namespace: default, details [is running: 
true, has synced: true, is watching: true]
    26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
HEALTHY for type: Pod, namespace: default, details [is running: true, has 
synced: true, is watching: true]
    26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
UNHEALTHY for type: SparkCluster, namespace: default, details [is running: 
true, has synced: true, is watching: false]
    26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
HEALTHY for type: Pod, namespace: default, details [is running: true, has 
synced: true, is watching: true]
    26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
UNHEALTHY for type: SparkCluster, namespace: default, details [is running: 
true, has synced: true, is watching: false]
    26/02/08 01:59:52 DEBUG   i.j.o.p.e.s.i.InformerWrapper Informer status: 
UNHEALTHY for type: SparkCluster, namespace: default, details [is running: 
true, has synced: true, is watching: false]
    26/02/08 01:59:52 ERROR   o.a.s.k.o.p.HealthProbe Controller: 
sparkclusterreconciler, Event Source: ControllerResourceEventSource, Informer: 
UNHEALTHY is in default, not a healthy state
    ```
    
    **AbstractWatchManager Behavior**
    ```
    $ k logs -f spark-kubernetes-operator-68c55d48d9-548mz| grep 
'AbstractWatchManager'
    26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching 
https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=276933&timeoutSeconds=600&watch=true...
    26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching 
https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=276931&timeoutSeconds=600&watch=true...
    26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching 
https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true...
    26/02/08 01:59:55 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching 
https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true...
    26/02/08 02:06:23 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the 
current watch
    26/02/08 02:06:23 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling 
reconnect task in 1000 ms
    26/02/08 02:06:24 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching 
https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=277053&timeoutSeconds=600&watch=true...
    26/02/08 02:07:14 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the 
current watch
    26/02/08 02:07:14 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling 
reconnect task in 1000 ms
    26/02/08 02:07:15 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching 
https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true...
    26/02/08 02:07:33 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the 
current watch
    26/02/08 02:07:33 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling 
reconnect task in 1000 ms
    26/02/08 02:07:34 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching 
https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true...
    26/02/08 02:08:41 DEBUG   i.f.k.c.d.i.AbstractWatchManager Closing the 
current watch
    26/02/08 02:08:41 DEBUG   i.f.k.c.d.i.AbstractWatchManager Scheduling 
reconnect task in 1000 ms
    26/02/08 02:08:42 DEBUG   i.f.k.c.d.i.AbstractWatchManager Watching 
https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=277089&timeoutSeconds=600&watch=true...
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    This will fix the regression at v0.7.0.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: `Opus 4.5` on `Claude Code`
    
    Closes #491 from dongjoon-hyun/SPARK-55421.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl | 2 +-
 build-tools/helm/spark-kubernetes-operator/values.yaml            | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl 
b/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl
index a24ba57..5891d6d 100644
--- a/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl
+++ b/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl
@@ -147,7 +147,7 @@ Liveness Probe properties override
 {{- default 10 
.Values.operatorDeployment.operatorPod.operatorContainer.probes.livenessProbe.periodSeconds
 }}
 {{- end }}
 {{- define "spark-operator.livenessProbe.failureThreshold" -}}
-{{- default 1 
.Values.operatorDeployment.operatorPod.operatorContainer.probes.livenessProbe.failureThreshold
 }}
+{{- default 3 
.Values.operatorDeployment.operatorPod.operatorContainer.probes.livenessProbe.failureThreshold
 }}
 {{- end }}
 {{- define "spark-operator.livenessProbe.timeoutSeconds" -}}
 {{- default 1 
.Values.operatorDeployment.operatorPod.operatorContainer.probes.livenessProbe.timeoutSeconds
 }}
diff --git a/build-tools/helm/spark-kubernetes-operator/values.yaml 
b/build-tools/helm/spark-kubernetes-operator/values.yaml
index b0e948d..1b528e6 100644
--- a/build-tools/helm/spark-kubernetes-operator/values.yaml
+++ b/build-tools/helm/spark-kubernetes-operator/values.yaml
@@ -61,7 +61,7 @@ operatorDeployment:
         livenessProbe:
           periodSeconds: 10
           initialDelaySeconds: 30
-          failureThreshold: 1
+          failureThreshold: 3
           timeoutSeconds: 1
         startupProbe:
           failureThreshold: 30


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark-kubernetes-operator) branch main updated: [SPARK-55421] Increase `livenessProbe.failureThreshold` to 3

Reply via email to