[
https://issues.apache.org/jira/browse/SPARK-53145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-53145:
-----------------------------------
Labels: pull-request-available (was: )
> Task rerun caused by executor decommission triggered by DRA
> -----------------------------------------------------------
>
> Key: SPARK-53145
> URL: https://issues.apache.org/jira/browse/SPARK-53145
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.5.0
> Reporter: Zhen Wang
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2025-08-06-21-45-19-750.png
>
>
> Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task
> re-run many times. But in my case, the decommission was triggered by DRA.
> !image-2025-08-06-21-45-19-750.png!
> related configurations:
> {code:java}
> spark.decommission.enabled=true
> spark.shuffle.service.enabled=true
> spark.dynamicAllocation.enabled=true{code}
> related logs:
> {code:java}
> 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors:
> 208
> 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to
> decommission.
> 25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers
> (BlockManagerId(208, xxx, 22752, None)) as being decommissioning.
> 25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due
> to idle timeout.
> 25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is
> decommissioned after 1.0 s.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3)
> 25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove
> reason statistics: (gracefully decommissioned: 203, decommision unfinished:
> 0, driver killed: 0, unexpectedly exited: 0). {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]