turboFei commented on issue #25941: [WIP][SPARK-29257][Core][Shuffle] Use task 
attempt number as noop reduce id to handle disk failures during shuffle
URL: https://github.com/apache/spark/pull/25941#issuecomment-535443610
 
 
    About this issue, I propose my thought.
   
   This shuffleIndexFile is kept for per excetor to store partition lengths.
   
   And when ESS and dynamic are enabled, it will readed by ESS.
   
   I think we can define a parameter, such as spark.shuffle.index.maxRetry(just 
make sense) to control the max retry to get a NoOPReduceId to avoid located to 
same bad disk every time.
   
   Its default value can be the num of localDirs.
   
   
   
   We can define a inner class in IndexShuffleResolver and use an atomicInteger 
to present the value of NOOPReduceID.
   
   
   
   When failed get IndexFile, we increamentAndGet a new NOOPReduceId, which can 
not exceed `spark.shuffle.index.maxRetry`.
   
   So the index file name is shuffleId-MapId-retriedNoOPReduceId.
   
   
   
   And when read data from indexFile, we should try to read indexFile named 
`shuffleId-mapId-0`, if fail, falling back to read  `shuffleId-mapId-0` , until 
`shuffleId-mapId-${maxRetry-1}`. 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to