turboFei edited a comment on issue #25941: [WIP][SPARK-29257][Core][Shuffle] 
Use task attempt number as noop reduce id to handle disk failures during shuffle
URL: https://github.com/apache/spark/pull/25941#issuecomment-535443610
 
 
    About this issue, I propose my thought.
   
   This shuffleIndexFile is kept for per executor to store partition lengths.
   
   And when ESS is enabled, it will be read by ESS.
   
   I think we can define a parameter, such as spark.shuffle.index.maxRetry(just 
make sense) to control the max retry to get a NoOPReduceId to avoid located to 
same bad disk every time.
   
   Its default value can be the num of localDirs.
   
   
   
   We can define a inner class in IndexShuffleResolver and use an atomicInteger 
to present the value of NOOPReduceID.
   
   
   
   When failed get IndexFile, we increamentAndGet a new NOOPReduceId, which can 
not exceed `spark.shuffle.index.maxRetry`.
   
   So the name of index file is shuffleId-MapId-retriedNoOPReduceId.
   
   
   
   And when read data from indexFile, we should try to read indexFile named 
`shuffleId-mapId-0`, if fail, fall back to read  `shuffleId-mapId-1` , until 
`shuffleId-mapId-${maxRetry-1}`. 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to