silent-night-no-trace opened a new issue, #17521:
URL: https://github.com/apache/dolphinscheduler/issues/17521

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   I use ds to schedule spark tasks (using spark-on-k8s mode). When the pod 
runs for too long, the task instance page of ds will show the following error 
in the specific task log:
   
   
   ```log
   191892f6b9e44020aea1005406d009b2 (phase: Running)
   [INFO] 2025-09-12 14:20:54.320 +0800 -  -> 
        25/09/12 14:20:53 INFO LoggingPodStatusWatcherImpl: Application status 
for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
   [INFO] 2025-09-12 14:20:55.320 +0800 -  -> 
        25/09/12 14:20:54 INFO LoggingPodStatusWatcherImpl: Application status 
for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
   [INFO] 2025-09-12 14:20:56.321 +0800 -  -> 
        25/09/12 14:20:55 INFO LoggingPodStatusWatcherImpl: Application status 
for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
   [INFO] 2025-09-12 14:20:57.322 +0800 -  -> 
        25/09/12 14:20:56 INFO LoggingPodStatusWatcherImpl: Application status 
for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
   [INFO] 2025-09-12 14:20:58.323 +0800 -  -> 
        25/09/12 14:20:57 INFO LoggingPodStatusWatcherImpl: Application status 
for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
   [INFO] 2025-09-12 14:20:59.324 +0800 -  -> 
        25/09/12 14:20:58 INFO LoggingPodStatusWatcherImpl: Application status 
for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
        25/09/12 14:20:59 INFO LoggingPodStatusWatcherImpl: State changed, new 
state: 
                 pod name: 
com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
                 namespace: ds
                 labels: dolphinscheduler-label -> 103282_91946, spark-app-name 
-> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector 
-> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version 
-> 3.5.5
                 pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c
                 creation time: 2025-09-12T06:10:40Z
                 service account name: spark-driver-sa
                 volumes: spark-history-logs-pvc, hadoop-properties, 
spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt
                 node name: kubesphere-node-7
                 start time: 2025-09-12T06:10:40Z
                 phase: Running
                 container status: 
                         container name: spark-kubernetes-driver
                         container image: 
harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
                         container state: terminated
                         container started at: 2025-09-12T06:10:42Z
                         container finished at: 2025-09-12T06:20:58Z
                         exit code: 0
                         termination reason: Completed
   [INFO] 2025-09-12 14:21:00.325 +0800 -  -> 
        25/09/12 14:20:59 INFO LoggingPodStatusWatcherImpl: Application status 
for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
   [INFO] 2025-09-12 14:21:01.325 +0800 -  -> 
        25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: State changed, new 
state: 
                 pod name: 
com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
                 namespace: ds
                 labels: dolphinscheduler-label -> 103282_91946, spark-app-name 
-> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector 
-> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version 
-> 3.5.5
                 pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c
                 creation time: 2025-09-12T06:10:40Z
                 service account name: spark-driver-sa
                 volumes: spark-history-logs-pvc, hadoop-properties, 
spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt
                 node name: kubesphere-node-7
                 start time: 2025-09-12T06:10:40Z
                 phase: Running
                 container status: 
                         container name: spark-kubernetes-driver
                         container image: 
harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
                         container state: terminated
                         container started at: 2025-09-12T06:10:42Z
                         container finished at: 2025-09-12T06:20:58Z
                         exit code: 0
                         termination reason: Completed
        25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application status 
for spark-191892f6b9e44020aea1005406d009b2 (phase: Running)
        25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: State changed, new 
state: 
                 pod name: 
com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
                 namespace: ds
                 labels: dolphinscheduler-label -> 103282_91946, spark-app-name 
-> com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob, spark-app-selector 
-> spark-191892f6b9e44020aea1005406d009b2, spark-role -> driver, spark-version 
-> 3.5.5
                 pod uid: 2c52087d-daf3-43c8-b57e-95841be1d52c
                 creation time: 2025-09-12T06:10:40Z
                 service account name: spark-driver-sa
                 volumes: spark-history-logs-pvc, hadoop-properties, 
spark-local-dir-1, spark-conf-volume-driver, kube-api-access-d5fwt
                 node name: kubesphere-node-7
                 start time: 2025-09-12T06:10:40Z
                 phase: Succeeded
                 container status: 
                         container name: spark-kubernetes-driver
                         container image: 
harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
                         container state: terminated
                         container started at: 2025-09-12T06:10:42Z
                         container finished at: 2025-09-12T06:20:58Z
                         exit code: 0
                         termination reason: Completed
        25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application status 
for spark-191892f6b9e44020aea1005406d009b2 (phase: Succeeded)
        25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Container final 
statuses:
        
        
                 container name: spark-kubernetes-driver
                 container image: 
harbor.jifenfu.net/apache/spark:3.5.5-scala2.12-java17-python3-ubuntu
                 container state: terminated
                 container started at: 2025-09-12T06:10:42Z
                 container finished at: 2025-09-12T06:20:58Z
                 exit code: 0
                 termination reason: Completed
        25/09/12 14:21:00 INFO LoggingPodStatusWatcherImpl: Application 
com.analysis.job.UserCleanupMergeBatchJob with application ID 
spark-191892f6b9e44020aea1005406d009b2 and submission ID 
ds:com-wn-cloud-cdp-analysis-job-usercleanupmergebatchjob-bfc75f993c8bc8b4-driver
 finished
        25/09/12 14:21:00 INFO ShutdownHookManager: Shutdown hook called
        25/09/12 14:21:00 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-1c12b6c0-f57a-4bf6-9321-336f6045ba05
   [ERROR] 2025-09-12 14:21:01.327 +0800 - Handle pod log error
   java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
java.lang.RuntimeException: The driver pod does not exist.
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at 
org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.run(AbstractCommandExecutor.java:182)
        at 
org.apache.dolphinscheduler.plugin.task.api.AbstractYarnTask.handle(AbstractYarnTask.java:53)
        at 
org.apache.dolphinscheduler.server.worker.runner.DefaultWorkerTaskExecutor.executeTask(DefaultWorkerTaskExecutor.java:51)
        at 
org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecutor.run(WorkerTaskExecutor.java:172)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The 
driver pod does not exist.
        at 
org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.lambda$collectPodLogIfNeeded$0(AbstractCommandExecutor.java:254)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        ... 3 common frames omitted
   Caused by: java.lang.RuntimeException: The driver pod does not exist.
        at 
org.apache.dolphinscheduler.plugin.task.api.AbstractCommandExecutor.lambda$collectPodLogIfNeeded$0(AbstractCommandExecutor.java:244)
        ... 7 common frames omitted
   [INFO] 2025-09-12 14:21:01.328 +0800 - process has exited. execute 
path:/tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946,
 processId:49917 ,exitStatusCode:0 ,processWaitForStatus:true 
,processExitValue:0
   [INFO] 2025-09-12 14:21:01.328 +0800 - Start finding appId in 
/opt/dolphinscheduler/logs/20250912/151590881017889/8/103282/91946.log, fetch 
way: log 
   [INFO] 2025-09-12 14:21:01.330 +0800 - 
   
***********************************************************************************************
   [INFO] 2025-09-12 14:21:01.330 +0800 - *********************************  
Finalize task instance  ************************************
   [INFO] 2025-09-12 14:21:01.330 +0800 - 
***********************************************************************************************
   [INFO] 2025-09-12 14:21:01.331 +0800 - Upload output files: [] successfully
   [INFO] 2025-09-12 14:21:01.333 +0800 - Send task execute status: SUCCESS to 
master : dolphinscheduler-worker-1.dolphinscheduler-worker-headless:1234
   [INFO] 2025-09-12 14:21:01.333 +0800 - Remove the current task execute 
context from worker cache
   [INFO] 2025-09-12 14:21:01.334 +0800 - The current execute mode isn't 
develop mode, will clear the task execute file: 
/tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946
   [INFO] 2025-09-12 14:21:01.347 +0800 - Success clear the task execute file: 
/tmp/dolphinscheduler/exec/process/root/147374930387104/151590881017889_8/103282/91946
   [INFO] 2025-09-12 14:21:01.347 +0800 - FINALIZE_SESSION
   ```
   
   
   ### What you expected to happen
   
   ds task instance page Each task instance can pull logs normally
   
   ### How to reproduce
   
   When using ds to schedule Spark tasks (spark-on-k8s mode), the task 
execution time is greater than 10 minutes.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   3.2.x
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to