yaooqinn commented on issue #25941: [WIP][SPARK-29257][Core][Shuffle] Use task 
attempt number as noop reduce id to handle disk failures during shuffle
URL: https://github.com/apache/spark/pull/25941#issuecomment-535545941
 
 
   We will not create duplicate shuffle files at all. The unsuccess attempts 
may have a chance to create or pick a subdirectory under the "blockmgr-xxx" 
before disk failures on it, but they can not commit the index and data files 
because of the disk failure. The 
“shuffle_$shuffleId_$mapId_$attemptNumber_0.index” and 
“shuffle_$shuffleId_$mapId_$attemptNumber_0.data” may locate in different 
disks, and any of them meets the bad disk will abort the write process and fail 
the task attempt. Only the success task can create those files.
   
   In another case, if the disk failure happens in shuffle read phase, which 
may cause fetch failed exception and re-run the dependent map taskes, but I 
guess for those duplicated files Our processing logic will not change.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to