yaooqinn commented on issue #25962: [SPARK-29285][Shuffle] Temporary shuffle files should be able to handle disk failures URL: https://github.com/apache/spark/pull/25962#issuecomment-547715872 > So the only time Hadoop should show you this bad disk is if yarn doesn't detect it or it if goes bad during the running of the container. YARN has a specific feature to detect bad disks and will not give that to the container if they are bad. So in your case are you executors very long running? Are you using the yarn feature? Yarn disk health check is on. Yes, it does help much for our whole 40k-50k Spark jobs daily. But it does not help much for those with long-lived (30min or longer) executors, which might be big ETL jobs. > What happens when we go to rename/merge the temp file to final location? the shuffle file name is static so should hash to same dir every time unless we are adding different dir. When using multi disks, the temp file, and the final file more likely to pick different disks. If the final one picks the bad disk, the task attempt is doomed to failure. But if only the temp one picks, it is still worth saving it. > Yes it depends on how often your executors are created/destroyed, if using dynamic allocation and a lot of long tail it could be cycling those fairly often and yarn disk checker should help, if not it won't. Lots of jobs it won't help by itself. Most of our spark jobs are dynamic allocation on. But executor recycling is not granular enough to handle disk failures, as the disk check of yarn is only just one periodic task. > Is this ok, maybe, but it's potentially changing from failing fast to failing later. if there is a long time between those then you potentially taking longer. This can happen if and only if the temp and the final files pick the same disk. But comparing to reschedule an entire task, how long can success or fail to rename/move a file be? > Has this actually been run on real jobs and have you seen a benefit? We have applied this to that cluster for more than 3 months. I have not performed very accurate statistics for such an exception. But before this, the users come to for help with this kind of failure once every 2 or 3 days on average. Since then, there is none.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
