[ 
https://issues.apache.org/jira/browse/SPARK-53145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-53145:
-----------------------------------
    Labels: pull-request-available  (was: )

> Task rerun caused by executor decommission triggered by DRA
> -----------------------------------------------------------
>
>                 Key: SPARK-53145
>                 URL: https://issues.apache.org/jira/browse/SPARK-53145
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.5.0
>            Reporter: Zhen Wang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2025-08-06-21-45-19-750.png
>
>
> Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task 
> re-run many times. But in my case, the decommission was triggered by DRA. 
> !image-2025-08-06-21-45-19-750.png!
> related configurations:
> {code:java}
> spark.decommission.enabled=true
> spark.shuffle.service.enabled=true
> spark.dynamicAllocation.enabled=true{code}
> related logs:
> {code:java}
> 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors: 
> 208
> 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to 
> decommission.
> 25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
> (BlockManagerId(208, xxx, 22752, None)) as being decommissioning.
> 25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due 
> to idle timeout.
> 25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is 
> decommissioned after 1.0 s.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so 
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so 
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so 
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so 
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3)
> 25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove 
> reason statistics: (gracefully decommissioned: 203, decommision unfinished: 
> 0, driver killed: 0, unexpectedly exited: 0). {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to