yaooqinn commented on issue #25962: [SPARK-29285][Shuffle] Temporary shuffle 
files should be able to handle disk failures
URL: https://github.com/apache/spark/pull/25962#issuecomment-547224500
 
 
   > I see what you're trying to do here, but does this really buy you much? If 
you have one bad disk, even if you can prevent temporary files from going to 
that disk, the final destination files still have a really high chance of going 
to that disk, don't they?
   
   @squito Yes, as the temp file name is random, it still has a chance to go to 
that bad disk. But with 10 times max retries, the probability can go very low. 
And it is worth preventing one task from failure and rescheduling after it has 
done all the calculation process right before the commit process, especially 
when the task is heavy, skewed...
   
   In our 2000 nodes Hadoop cluster, which with 12 disks/node, this approach 
reduce the number of that exception a lot. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to