[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

Hussein Awala (Jira) Wed, 08 Feb 2023 05:52:08 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685897#comment-17685897
 ]


Hussein Awala commented on SPARK-34645:
---------------------------------------

I am facing a similar problem with Spark 3.2.1 and JDK 8, I'm running the jobs 
in client mode on arm64 nodes, in 10% of these jobs, after deleting the 
executors pods and the created PVCs, the driver pod stucks in running state 
with this log:
{code:java}
3/02/08 13:04:38 INFO SparkUI: Stopped Spark web UI at http://172.17.45.51:4040
23/02/08 13:04:38 INFO KubernetesClusterSchedulerBackend: Shutting down all 
executors
23/02/08 13:04:38 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
executor to shut down
23/02/08 13:04:38 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
been closed.
23/02/08 13:04:39 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
23/02/08 13:04:39 INFO MemoryStore: MemoryStore cleared
23/02/08 13:04:39 INFO BlockManager: BlockManager stopped
23/02/08 13:04:39 INFO BlockManagerMaster: BlockManagerMaster stopped
23/02/08 13:04:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
23/02/08 13:04:39 INFO SparkContext: Successfully stopped SparkContext {code}
JDK:
{code:java}
root@***:/# java -version | tail -n3
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_362-b09)
OpenJDK 64-Bit Server VM (Temurin)(build 25.362-b09, mixed mode) {code}
I tried:
 * without the conf _spark.kubernetes.driver.reusePersistentVolumeClaim_ and 
without the PVC at all
 * applying the patch 
[https://github.com/apache/spark/commit/457b75ea2bca6b5811d61ce9f1d28c94b0dde3a2]
 proposed by [~mickayg] on spark 3.2.1
 * upgrading tp 3.2.3

but I still have the same problem.

I didn't find any relevant fix in the spark 3.3.0 and 3.3.1 release note except 
upgrading the kubernetes client, do you have some tips for investigating the 
issue?

> [K8S] Driver pod stuck in Running state after job completes
> -----------------------------------------------------------
>
>                 Key: SPARK-34645
>                 URL: https://issues.apache.org/jira/browse/SPARK-34645
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 3.0.2
>         Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>            Reporter: Andy Grove
>            Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

Reply via email to