[jira] [Updated] (SPARK-53145) Task rerun caused by executor decommission triggered by DRA

Zhen Wang (Jira) Wed, 06 Aug 2025 10:22:03 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-53145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhen Wang updated SPARK-53145:
------------------------------
    Description: 
Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task re-run 
many times, but in my case, the decommission was triggered by DRA. 

!image-2025-08-06-21-44-18-059.png!

related configurations:
{code:java}
spark.decommission.enabled=true
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true{code}
related logs:

 
{code:java}
25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors: 208
25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to 
decommission.
25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(208, xxx, 22752, None)) as being decommissioning.
25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due to 
idle timeout.
25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is 
decommissioned after 1.0 s.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so 
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so 
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so 
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so 
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3)
25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove reason 
statistics: (gracefully decommissioned: 203, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0). {code}
 

 

  was:
Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task re-run 
many times, but in my case, the decommission was triggered by DRA. 

!task.png!

related configurations:
{code:java}
spark.decommission.enabled=true
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true{code}
related logs:

 
{code:java}
25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors: 208
25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to 
decommission.
25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(208, xxx, 22752, None)) as being decommissioning.
25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due to 
idle timeout.
25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is 
decommissioned after 1.0 s.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so 
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so 
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so 
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so 
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3)
25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove reason 
statistics: (gracefully decommissioned: 203, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0). {code}
 

 


> Task rerun caused by executor decommission triggered by DRA
> -----------------------------------------------------------
>
>                 Key: SPARK-53145
>                 URL: https://issues.apache.org/jira/browse/SPARK-53145
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.5.0
>            Reporter: Zhen Wang
>            Priority: Major
>         Attachments: image-2025-08-06-21-44-18-059.png
>
>
> Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task 
> re-run many times, but in my case, the decommission was triggered by DRA. 
> !image-2025-08-06-21-44-18-059.png!
> related configurations:
> {code:java}
> spark.decommission.enabled=true
> spark.shuffle.service.enabled=true
> spark.dynamicAllocation.enabled=true{code}
> related logs:
>  
> {code:java}
> 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors: 
> 208
> 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to 
> decommission.
> 25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
> (BlockManagerId(208, xxx, 22752, None)) as being decommissioning.
> 25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due 
> to idle timeout.
> 25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is 
> decommissioned after 1.0 s.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so 
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so 
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so 
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so 
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3)
> 25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove 
> reason statistics: (gracefully decommissioned: 203, decommision unfinished: 
> 0, driver killed: 0, unexpectedly exited: 0). {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-53145) Task rerun caused by executor decommission triggered by DRA

Reply via email to