warrenzhu25 opened a new pull request, #41905:
URL: https://github.com/apache/spark/pull/41905

   ### What changes were proposed in this pull request?
   Do not increase shuffle migration failure count when target executor 
decommissioned
   
   
   ### Why are the changes needed?
   Block manager decommissioner only sync with block manager master about live 
peers every . If some block manager decommissioned between this, it still try 
to migrated shuffle to such decommissioned block manger. The migration will be 
failed with 
RuntimeException("BlockSavedOnDecommissionedBlockManagerException"). Detailed 
stack trace as below:
   
   ```
   org.apache.spark.SparkException: Exception thrown in awaitResult:
       at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
       at 
org.apache.spark.network.BlockTransferService.uploadBlockSync(BlockTransferService.scala:122)
       at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5(BlockManagerDecommissioner.scala:127)
       at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$5$adapted(BlockManagerDecommissioner.scala:118)
       at scala.collection.immutable.List.foreach(List.scala:431)
       at 
org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:118)
    at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
       at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       at java.base/java.lang.Thread.run(Thread.java:829)
   Caused by: java.lang.RuntimeException: 
org.apache.spark.storage.BlockSavedOnDecommissionedBlockManagerException: Block 
shuffle_2_6429_0.data cannot be saved on decommissioned executor
       at 
org.apache.spark.errors.SparkCoreErrors$.cannotSaveBlockOnDecommissionedExecutorError(SparkCoreErrors.scala:238)
    at 
org.apache.spark.storage.BlockManager.checkShouldStore(BlockManager.scala:277)
       at 
org.apache.spark.storage.BlockManager.putBlockDataAsStream(BlockManager.scala:741)
       at 
org.apache.spark.network.netty.NettyBlockRpcServer.receiveStream(NettyBlockRpcServer.scala:174)
   
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added UT in `BlockManagerDecommissionUnitSuite`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to