turboFei edited a comment on issue #25941: [WIP][SPARK-29257][Core][Shuffle] Use task attempt number as noop reduce id to handle disk failures during shuffle URL: https://github.com/apache/spark/pull/25941#issuecomment-535443610 About this issue, I propose my thought. This shuffleIndexFile is kept for per executor to store partition lengths. And when ESS is enabled, it will be read by ESS. I think we can define a parameter, such as spark.shuffle.index.maxRetry(just make sense) to control the max retry to get a NoOPReduceId to avoid located to same bad disk every time. Its default value can be the num of localDirs. We can define a inner class in IndexShuffleResolver and use an atomicInteger to present the value of NOOPReduceID. When failed get IndexFile, we increamentAndGet a new NOOPReduceId, which can not exceed `spark.shuffle.index.maxRetry`. So the index file name is shuffleId-MapId-retriedNoOPReduceId. And when read data from indexFile, we should try to read indexFile named `shuffleId-mapId-0`, if fail, falling back to read `shuffleId-mapId-0` , until `shuffleId-mapId-${maxRetry-1}`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
