This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/spark-kubernetes-operator.git
The following commit(s) were added to refs/heads/main by this push:
new 7261b0a [SPARK-55421] Increase `livenessProbe.failureThreshold` to 3
7261b0a is described below
commit 7261b0a179395e3ef1c7bf5e19b5d2b1356c21a8
Author: Dongjoon Hyun <[email protected]>
AuthorDate: Sat Feb 7 18:57:27 2026 -0800
[SPARK-55421] Increase `livenessProbe.failureThreshold` to 3
### What changes were proposed in this pull request?
This PR aims to increase `livenessProbe.failureThreshold` to 3.
### Why are the changes needed?
SPARK-54328 introduce a severe race condition which causes the `Apache
Spark Operator` restarts too frequently. Technically, `WebSocket` is supposed
to restart regularly and the reconnect time is scheduled after 1000ms.
`HealthProbe` checks during this reconnection and kills the operator because
SPARK-54328 used `failureThreshold=1` which means no failure is allowed. Here
is the full log.
- https://github.com/apache/spark-kubernetes-operator/pull/417
**Spark Operator Restart Log**
```
26/02/08 01:59:42 DEBUG o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
HEALTHY for type: SparkApplication, namespace: default, details [is running:
true, has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
HEALTHY for type: Pod, namespace: default, details [is running: true, has
synced: true, is watching: true]
26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
HEALTHY for type: SparkCluster, namespace: default, details [is running: true,
has synced: true, is watching: true]
26/02/08 01:59:42 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
HEALTHY for type: Pod, namespace: default, details [is running: true, has
synced: true, is watching: true]
26/02/08 01:59:51 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the
current watch
26/02/08 01:59:51 DEBUG i.f.k.c.d.i.WatcherWebSocketListener WebSocket
close received. code: 1000, reason: null
26/02/08 01:59:51 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling
reconnect task in 1000 ms
26/02/08 01:59:52 DEBUG o.a.s.k.o.p.HealthProbe Checking informer health
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
HEALTHY for type: SparkApplication, namespace: default, details [is running:
true, has synced: true, is watching: true]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
HEALTHY for type: Pod, namespace: default, details [is running: true, has
synced: true, is watching: true]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
UNHEALTHY for type: SparkCluster, namespace: default, details [is running:
true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
HEALTHY for type: Pod, namespace: default, details [is running: true, has
synced: true, is watching: true]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
UNHEALTHY for type: SparkCluster, namespace: default, details [is running:
true, has synced: true, is watching: false]
26/02/08 01:59:52 DEBUG i.j.o.p.e.s.i.InformerWrapper Informer status:
UNHEALTHY for type: SparkCluster, namespace: default, details [is running:
true, has synced: true, is watching: false]
26/02/08 01:59:52 ERROR o.a.s.k.o.p.HealthProbe Controller:
sparkclusterreconciler, Event Source: ControllerResourceEventSource, Informer:
UNHEALTHY is in default, not a healthy state
```
**AbstractWatchManager Behavior**
```
$ k logs -f spark-kubernetes-operator-68c55d48d9-548mz| grep
'AbstractWatchManager'
26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching
https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=276933&timeoutSeconds=600&watch=true...
26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching
https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=276931&timeoutSeconds=600&watch=true...
26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching
https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true...
26/02/08 01:59:55 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching
https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=276933&timeoutSeconds=600&watch=true...
26/02/08 02:06:23 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the
current watch
26/02/08 02:06:23 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling
reconnect task in 1000 ms
26/02/08 02:06:24 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching
https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkclusters?allowWatchBookmarks=true&resourceVersion=277053&timeoutSeconds=600&watch=true...
26/02/08 02:07:14 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the
current watch
26/02/08 02:07:14 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling
reconnect task in 1000 ms
26/02/08 02:07:15 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching
https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true...
26/02/08 02:07:33 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the
current watch
26/02/08 02:07:33 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling
reconnect task in 1000 ms
26/02/08 02:07:34 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching
https://10.43.0.1:443/api/v1/namespaces/default/pods?allowWatchBookmarks=true&labelSelector=spark.operator%2Fname%3Dspark-kubernetes-operator&resourceVersion=277072&timeoutSeconds=600&watch=true...
26/02/08 02:08:41 DEBUG i.f.k.c.d.i.AbstractWatchManager Closing the
current watch
26/02/08 02:08:41 DEBUG i.f.k.c.d.i.AbstractWatchManager Scheduling
reconnect task in 1000 ms
26/02/08 02:08:42 DEBUG i.f.k.c.d.i.AbstractWatchManager Watching
https://10.43.0.1:443/apis/spark.apache.org/v1/namespaces/default/sparkapplications?allowWatchBookmarks=true&resourceVersion=277089&timeoutSeconds=600&watch=true...
```
### Does this PR introduce _any_ user-facing change?
This will fix the regression at v0.7.0.
### How was this patch tested?
Pass the CIs.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: `Opus 4.5` on `Claude Code`
Closes #491 from dongjoon-hyun/SPARK-55421.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
---
build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl | 2 +-
build-tools/helm/spark-kubernetes-operator/values.yaml | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl
b/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl
index a24ba57..5891d6d 100644
--- a/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl
+++ b/build-tools/helm/spark-kubernetes-operator/templates/_helpers.tpl
@@ -147,7 +147,7 @@ Liveness Probe properties override
{{- default 10
.Values.operatorDeployment.operatorPod.operatorContainer.probes.livenessProbe.periodSeconds
}}
{{- end }}
{{- define "spark-operator.livenessProbe.failureThreshold" -}}
-{{- default 1
.Values.operatorDeployment.operatorPod.operatorContainer.probes.livenessProbe.failureThreshold
}}
+{{- default 3
.Values.operatorDeployment.operatorPod.operatorContainer.probes.livenessProbe.failureThreshold
}}
{{- end }}
{{- define "spark-operator.livenessProbe.timeoutSeconds" -}}
{{- default 1
.Values.operatorDeployment.operatorPod.operatorContainer.probes.livenessProbe.timeoutSeconds
}}
diff --git a/build-tools/helm/spark-kubernetes-operator/values.yaml
b/build-tools/helm/spark-kubernetes-operator/values.yaml
index b0e948d..1b528e6 100644
--- a/build-tools/helm/spark-kubernetes-operator/values.yaml
+++ b/build-tools/helm/spark-kubernetes-operator/values.yaml
@@ -61,7 +61,7 @@ operatorDeployment:
livenessProbe:
periodSeconds: 10
initialDelaySeconds: 30
- failureThreshold: 1
+ failureThreshold: 3
timeoutSeconds: 1
startupProbe:
failureThreshold: 30
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]