yaooqinn commented on issue #25962: [SPARK-29285][Shuffle] Temporary shuffle files should be able to handle disk failures URL: https://github.com/apache/spark/pull/25962#issuecomment-547224500 > I see what you're trying to do here, but does this really buy you much? If you have one bad disk, even if you can prevent temporary files from going to that disk, the final destination files still have a really high chance of going to that disk, don't they? @squito Yes, as the temp file name is random, it still has a chance to go to that bad disk. But with 10 times max retries, the probability can go very low. And it is worth preventing one task from failure and rescheduling after it has done all the calculation process right before the commit process, especially when the task is heavy, skewed... In our 2000 nodes Hadoop cluster, which with 12 disks/node, this approach reduce the number of that exception a lot.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
