warrenzhu25 opened a new pull request, #41905:
URL: https://github.com/apache/spark/pull/41905
### What changes were proposed in this pull request?
Do not increase shuffle migration failure count when target executor
decommissioned
### Why are the changes needed?
Block manager decommissioner only sync with block manager master about live
peers every . If some block manager decommissioned between this, it still try
to migrated shuffle to such decommissioned block manger. The migration will be
failed with
RuntimeException("BlockSavedOnDecommissionedBlockManagerException"). Detailed
stack trace as below:
```
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122)
at
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5(BlockManagerDecommissioner.scala:127)
at
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5$adapted(BlockManagerDecommissioner.scala:118)
at scala.collection.immutable.List.foreach(List.scala:431)
at
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:118)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.RuntimeException:
org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block
shuffle_2_6429_0.data cannot be saved on decommissioned executor
at
org.apache.spark.errors.SparkCoreErrors$.cannotSaveBlockOnDecommissionedExecutorError(SparkCoreErrors.scala:238)
at
org.apache.spark.storage.BlockManager.checkShouldStore(BlockManager.scala:277)
at
org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:741)
at
org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:174)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT in `BlockManagerDecommissionUnitSuite`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]