attilapiros opened a new pull request, #47779: URL: https://github.com/apache/spark/pull/47779
### What changes were proposed in this pull request? Credit to [maheshk114](https://github.com/maheshk114) for the initial investigation and the fix. This PR fix a bug where the shuffle service's ID is kept among the block location list at the removing of a RDD block from a node. Before this change `StorageLevel.NONE` is used to notify about the block remove which causes the block manager master ignoring the updating of the locations of for shuffle service's IDs (for details please see the method `BlockManagerMasterEndpoint#updateBlockInfo()` and keep in mind `StorageLevel.NONE.useDisk` is `false`). But after this change for notifying block remove only the replication count is set to 0 so `StorageLevel#isValid` is still false but `storageLevel.useDisk` is `true` so the the shuffle service's ID will be removed from the block location list. ### Why are the changes needed? If the block location is not updated properly, then tasks fails with fetch failed exception. The tasks will try to read the RDD blocks from a node using external shuffle service. The read will fail, if the node is already decommissioned. ``` WARN BlockManager [Executor task launch worker for task 25.0 in stage 6.0 (TID 1567)]: Failed to fetch remote block rdd_5_25 from BlockManagerId(4, vm-92303839, 7337, None) (failed attempt 1) org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301) at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:103) at org.apache.spark.storage.BlockManager.fetchRemoteManagedBuffer(BlockManager.scala:1155) at org.apache.spark.storage.BlockManager.$anonfun$getRemoteBlock$8(BlockManager.scala:1099) at scala.Option.orElse(Option.scala:447) at org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:1099) at org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:1045) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:1264) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1326) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new UT. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
