[
https://issues.apache.org/jira/browse/SPARK-53145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhen Wang updated SPARK-53145:
------------------------------
Description:
Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task re-run
many times, but in my case, the decommission was triggered by DRA.
!image-2025-08-06-21-44-18-059.png!
related configurations:
{code:java}
spark.decommission.enabled=true
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true{code}
related logs:
{code:java}
25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors: 208
25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to
decommission.
25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers
(BlockManagerId(208, xxx, 22752, None)) as being decommissioning.
25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due to
idle timeout.
25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is
decommissioned after 1.0 s.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3)
25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove reason
statistics: (gracefully decommissioned: 203, decommision unfinished: 0, driver
killed: 0, unexpectedly exited: 0). {code}
was:
Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task re-run
many times, but in my case, the decommission was triggered by DRA.
!task.png!
related configurations:
{code:java}
spark.decommission.enabled=true
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true{code}
related logs:
{code:java}
25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors: 208
25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to
decommission.
25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers
(BlockManagerId(208, xxx, 22752, None)) as being decommissioning.
25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due to
idle timeout.
25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is
decommissioned after 1.0 s.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so
marking it as still running.
25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3)
25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove reason
statistics: (gracefully decommissioned: 203, decommision unfinished: 0, driver
killed: 0, unexpectedly exited: 0). {code}
> Task rerun caused by executor decommission triggered by DRA
> -----------------------------------------------------------
>
> Key: SPARK-53145
> URL: https://issues.apache.org/jira/browse/SPARK-53145
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.5.0
> Reporter: Zhen Wang
> Priority: Major
> Attachments: image-2025-08-06-21-44-18-059.png
>
>
> Duplicate of https://issues.apache.org/jira/browse/SPARK-49472, the task
> re-run many times, but in my case, the decommission was triggered by DRA.
> !image-2025-08-06-21-44-18-059.png!
> related configurations:
> {code:java}
> spark.decommission.enabled=true
> spark.shuffle.service.enabled=true
> spark.dynamicAllocation.enabled=true{code}
> related logs:
>
> {code:java}
> 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Decommission executors:
> 208
> 25/08/06 09:55:08 INFO YarnClusterSchedulerBackend: Notify executor 208 to
> decommission.
> 25/08/06 09:55:08 INFO BlockManagerMasterEndpoint: Mark BlockManagers
> (BlockManagerId(208, xxx, 22752, None)) as being decommissioning.
> 25/08/06 09:55:08 INFO ExecutorAllocationManager: Executors 208 removed due
> to idle timeout.
> 25/08/06 09:55:09 INFO YarnClusterScheduler: Executor 208 on xxx is
> decommissioned after 1.0 s.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 99), so
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 3), so
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 170), so
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Resubmitted ShuffleMapTask(4, 155), so
> marking it as still running.
> 25/08/06 09:55:09 INFO DAGScheduler: Executor lost: 208 (epoch 3)
> 25/08/06 09:55:09 INFO ExecutorMonitor: Executor 208 is removed. Remove
> reason statistics: (gracefully decommissioned: 203, decommision unfinished:
> 0, driver killed: 0, unexpectedly exited: 0). {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]