Rocco Verhoef created SPARK-50337:
-------------------------------------
Summary: Error logged due to race condition when shutting down
kubernetes client
Key: SPARK-50337
URL: https://issues.apache.org/jira/browse/SPARK-50337
Project: Spark
Issue Type: Bug
Components: Kubernetes, Spark Core
Affects Versions: 3.5.3
Reporter: Rocco Verhoef
While running spark 3.5.3 on kubernetes, we sometimes get the error message
below in our logs.
Looking throught the code, we start 2 components to validate driver/executor
state in the k8 cluster. One is `ExecutorPodsWatchSnapshotSource`, which does a
watch on all events coming k8 and is fine. The other is
`ExecutorPodsPollingSnapshotSource`, which periodically pulls the snapshot
state from k8. which is likely there in case the event watcher gets disconnects
for a period (standard pattern).
Class `ExecutorPodsPollingSnapshotSource` has a thread running periodically in
the background to fetch the snapshot. On shutdown, it interrupts the future.
Depending where the future is at, this leads to different types of exceptions
being raised within that thread. If it raises on `InterruptedException`, then
class `Utils.tryLogNonFatalError` swallows it and doesn't print an error. But,
as per below, the `InterruptedException` gets wrapped then it prints the below
error.
I guess the solution should be in the direction of checking the root cause of
the exception.
```
WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
ERROR Utils: Uncaught exception in thread kubernetes-executor-pod-polling-sync
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [list] for
kind: [Pod] with name: [null] in namespace: [test-namespace] failed.
at
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:159)
at
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:422)
at
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:388)
at
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:92)
at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsPollingSnapshotSource$PollRunnable.$anonfun$run$1(ExecutorPodsPollingSnapshotSource.scala:92)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1375)
at
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsPollingSnapshotSource$PollRunnable.run(ExecutorPodsPollingSnapshotSource.scala:75)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.io.InterruptedIOException
at
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:505)
at
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:420)
... 11 more
Caused by: java.lang.InterruptedException
at java.base/java.util.concurrent.CompletableFuture.reportGet(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.get(Unknown Source)
at
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:502)
... 12 more
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]