dongjoon-hyun commented on a change in pull request #30492:
URL: https://github.com/apache/spark/pull/30492#discussion_r532122517
##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -627,7 +627,16 @@ private[spark] class BlockManager(
override def getLocalBlockData(blockId: BlockId): ManagedBuffer = {
if (blockId.isShuffle) {
logDebug(s"Getting local shuffle block ${blockId}")
- shuffleManager.shuffleBlockResolver.getBlockData(blockId)
+ try {
+ shuffleManager.shuffleBlockResolver.getBlockData(blockId)
+ } catch {
+ case e: IOException =>
+ if
(conf.get(config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined) {
+ FallbackStorage.read(conf, blockId)
+ } else {
+ throw e
+ }
Review comment:
It's already `IOException` and `FallbackStorage.read` throws
`IOException` for non-exist files and it's fine and legitimate in the
`WorkerDecomission` context, @viirya . Please note the following.
- The whole Worker decommission (including `shuffle and rdd storage
decommission`) is designed as a best-effort approach.
- It's because the main use case is K8s graceful shutdown with the default
period (`30s`). We can increase the period, but we cannot set it to the
infinite value technically. It means executor hangs in case of disk full
situation.
- What we are aiming is to rescue data as much as possible, but `100%` is
not guaranteed always.
- Lastly, data block selection(shuffle or rdd) was a random-order from the
beginning. It became worse when there are multiple shuffle across multiple
executors.
Due to the above reasons, we introduced `SPARK-33387 Support ordered shuffle
block migration` to rescue as complete as possible. However, we can still
easily imagine that multiple shuffles coexist on an executor in a skewed
manner. While the other executor succeeds to complete the migration, the skewed
executor may fail to complete the migration in the given grace period.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]