Rocco Verhoef created SPARK-50337:
-------------------------------------

             Summary: Error logged due to race condition when shutting down 
kubernetes client
                 Key: SPARK-50337
                 URL: https://issues.apache.org/jira/browse/SPARK-50337
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes, Spark Core
    Affects Versions: 3.5.3
            Reporter: Rocco Verhoef


While running spark 3.5.3 on kubernetes, we sometimes get the error message 
below in our logs.

Looking throught the code, we start 2 components to validate driver/executor 
state in the k8 cluster. One is `ExecutorPodsWatchSnapshotSource`, which does a 
watch on all events coming k8 and is fine. The other is 
`ExecutorPodsPollingSnapshotSource`, which periodically pulls the snapshot 
state from k8. which is likely there in case the event watcher gets disconnects 
for a period (standard pattern).

Class `ExecutorPodsPollingSnapshotSource` has a thread running periodically in 
the background to fetch the snapshot. On shutdown, it interrupts the future. 
Depending where the future is at, this leads to different types of exceptions 
being raised within that thread. If it raises on `InterruptedException`, then 
class `Utils.tryLogNonFatalError` swallows it and doesn't print an error. But, 
as per below, the  `InterruptedException` gets wrapped then it prints the below 
error.

I guess the solution should be in the direction of checking the root cause of 
the exception.

 

```
WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
ERROR Utils: Uncaught exception in thread kubernetes-executor-pod-polling-sync
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [list] for 
kind: [Pod] with name: [null] in namespace: [test-namespace] failed.
at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:159)
at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:422)
at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:388)
at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:92)
at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsPollingSnapshotSource$PollRunnable.$anonfun$run$1(ExecutorPodsPollingSnapshotSource.scala:92)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1375)
at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsPollingSnapshotSource$PollRunnable.run(ExecutorPodsPollingSnapshotSource.scala:75)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
 Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.io.InterruptedIOException
at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:505)
at 
io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:420)
... 11 more
Caused by: java.lang.InterruptedException
at java.base/java.util.concurrent.CompletableFuture.reportGet(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.get(Unknown Source)
at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:502)
... 12 more
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to