yaooqinn commented on issue #25962: [SPARK-29285][Shuffle] Temporary shuffle 
files should be able to handle disk failures
URL: https://github.com/apache/spark/pull/25962#issuecomment-547715872
 
 
   > So the only time Hadoop should show you this bad disk is if yarn doesn't 
detect it or it if goes bad during the running of the container. YARN has a 
specific feature to detect bad disks and will not give that to the container if 
they are bad. So in your case are you executors very long running? Are you 
using the yarn feature?
   
   Yarn disk health check is on. Yes, it does help much for our whole 40k-50k 
Spark jobs daily. But it does not help much for those with long-lived (30min or 
longer) executors, which might be big ETL jobs. 
   
   > What happens when we go to rename/merge the temp file to final location? 
the shuffle file name is static so should hash to same dir every time unless we 
are adding different dir. 
   
   When using multi disks, the temp file, and the final file more likely to 
pick different disks. If the final one picks the bad disk, the task attempt is 
doomed to failure. But if only the temp one picks, it is still worth saving it.
   
   > Yes it depends on how often your executors are created/destroyed, if using 
dynamic allocation and a lot of long tail it could be cycling those fairly 
often and yarn disk checker should help, if not it won't. Lots of jobs it won't 
help by itself.
   
   Most of our spark jobs are dynamic allocation on. But executor recycling is 
not granular enough to handle disk failures, as the disk check of yarn is only 
just one periodic task.
   
   
   > Is this ok, maybe, but it's potentially changing from failing fast to 
failing later. if there is a long time between those then you potentially 
taking longer.
   
   This can happen if and only if the temp and the final files pick the same 
disk. But comparing to reschedule an entire task, how long can success or fail 
to rename/move a file be?
   
   > Has this actually been run on real jobs and have you seen a benefit?
   
   We have applied this to that cluster for more than 3 months. I have not 
performed very accurate statistics for such an exception. But before this, the 
users come to for help with this kind of failure once every 2 or 3 days on 
average. Since then, there is none.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to