silent-night-no-trace opened a new issue, #17521: URL: https://github.com/apache/dolphinscheduler/issues/17521
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar issues. ### What happened I use ds to schedule spark tasks (using spark-on-k8s mode). When the pod runs for too long, the task instance page of ds will show the following error in the specific task log: ```log 191892f6b9e44020aea1005406d009b2 (phase: Running) [INFO] 2025-09-12 14:20:54.320 +0800 - -> 25/09/12 14:20:53 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running) [INFO] 2025-09-12 14:20:55.320 +0800 - -> 25/09/12 14:20:54 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running) [INFO] 2025-09-12 14:20:56.321 +0800 - -> 25/09/12 14:20:55 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running) [INFO] 2025-09-12 14:20:57.322 +0800 - -> 25/09/12 14:20:56 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running) [INFO] 2025-09-12 14:20:58.323 +0800 - -> 25/09/12 14:20:57 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running) [INFO] 2025-09-12 14:20:59.324 +0800 - -> 25/09/12 14:20:58 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running) 25/09/12 14:20:59 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver namespace: ds labels: dolphinscheduler-label -> 103282_91946, spark-app-name -> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector -> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version -> 3.5.5 pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c creation time: 2025-09-12T06:10:40Z service account name: spark-driver-sa volumes: spark-history-logs-pvc, hadoop-properties, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt node name: kubesphere-node-7 start time: 2025-09-12T06:10:40Z phase: Running container status: container name: spark-kubernetes-driver container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu container state: terminated container started at: 2025-09-12T06:10:42Z container finished at: 2025-09-12T06:20:58Z exit code: 0 termination reason: Completed [INFO] 2025-09-12 14:21:00.325 +0800 - -> 25/09/12 14:20:59 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running) [INFO] 2025-09-12 14:21:01.325 +0800 - -> 25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver namespace: ds labels: dolphinscheduler-label -> 103282_91946, spark-app-name -> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector -> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version -> 3.5.5 pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c creation time: 2025-09-12T06:10:40Z service account name: spark-driver-sa volumes: spark-history-logs-pvc, hadoop-properties, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt node name: kubesphere-node-7 start time: 2025-09-12T06:10:40Z phase: Running container status: container name: spark-kubernetes-driver container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu container state: terminated container started at: 2025-09-12T06:10:42Z container finished at: 2025-09-12T06:20:58Z exit code: 0 termination reason: Completed 25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Running) 25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver namespace: ds labels: dolphinscheduler-label -> 103282_91946, spark-app-name -> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector -> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version -> 3.5.5 pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c creation time: 2025-09-12T06:10:40Z service account name: spark-driver-sa volumes: spark-history-logs-pvc, hadoop-properties, spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt node name: kubesphere-node-7 start time: 2025-09-12T06:10:40Z phase: Succeeded container status: container name: spark-kubernetes-driver container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu container state: terminated container started at: 2025-09-12T06:10:42Z container finished at: 2025-09-12T06:20:58Z exit code: 0 termination reason: Completed 25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application status for spark-191892f6b9e44020aea1005406d009b2 (phase: Succeeded) 25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Container final statuses: container name: spark-kubernetes-driver container image: harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu container state: terminated container started at: 2025-09-12T06:10:42Z container finished at: 2025-09-12T06:20:58Z exit code: 0 termination reason: Completed 25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application com.analysis.job.UserCleanupMergeBatchJob with application ID spark-191892f6b9e44020aea1005406d009b2 and submission ID ds:com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver finished 25/09/12 14:21:00 INFO ShutdownHookManager: Shutdown hook called 25/09/12 14:21:00 INFO ShutdownHookManager: Deleting directory /tmp/spark-1c12b6c0-f57a-4bf6-9321-336f6045ba05 [ERROR] 2025-09-12 14:21:01.327 +0800 - Handle pod log error java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.RuntimeException: The driver pod does not exist. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.run(AbstractCommandExecutor.java:182) at org.apache.dolphinscheduler.plugin.task.api.AbstractYarnTask.handle(AbstractYarnTask.java:53) at org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerTaskExecutor.executeTask(DefaultWorkerTaskExecutor.java:51) at org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecutor.run(WorkerTaskExecutor.java:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The driver pod does not exist. at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.lambda$collectPodLogIfNeeded$0(AbstractCommandExecutor.java:254) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ... 3 common frames omitted Caused by: java.lang.RuntimeException: The driver pod does not exist. at org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.lambda$collectPodLogIfNeeded$0(AbstractCommandExecutor.java:244) ... 7 common frames omitted [INFO] 2025-09-12 14:21:01.328 +0800 - process has exited. execute path:/tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946, processId:49917 ,exitStatusCode:0 ,processWaitForStatus:true ,processExitValue:0 [INFO] 2025-09-12 14:21:01.328 +0800 - Start finding appId in /opt/dolphinscheduler/logs/20250912/151590881017889/8/103282/91946.log, fetch way: log [INFO] 2025-09-12 14:21:01.330 +0800 - *********************************************************************************************** [INFO] 2025-09-12 14:21:01.330 +0800 - ********************************* Finalize task instance ************************************ [INFO] 2025-09-12 14:21:01.330 +0800 - *********************************************************************************************** [INFO] 2025-09-12 14:21:01.331 +0800 - Upload output files: [] successfully [INFO] 2025-09-12 14:21:01.333 +0800 - Send task execute status: SUCCESS to master : dolphinscheduler-worker-1.dolphinscheduler-worker-headless:1234 [INFO] 2025-09-12 14:21:01.333 +0800 - Remove the current task execute context from worker cache [INFO] 2025-09-12 14:21:01.334 +0800 - The current execute mode isn't develop mode, will clear the task execute file: /tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946 [INFO] 2025-09-12 14:21:01.347 +0800 - Success clear the task execute file: /tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946 [INFO] 2025-09-12 14:21:01.347 +0800 - FINALIZE_SESSION ``` ### What you expected to happen ds task instance page Each task instance can pull logs normally ### How to reproduce When using ds to schedule Spark tasks (spark-on-k8s mode), the task execution time is greater than 10 minutes. ### Anything else _No response_ ### Version 3.2.x ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
