tgravescs commented on issue #25962: [SPARK-29285][Shuffle] Temporary shuffle 
files should be able to handle disk failures
URL: https://github.com/apache/spark/pull/25962#issuecomment-547425618
 
 
   > In our 2000 nodes Hadoop cluster, which with 12 disks/node, this approach 
reduce the number of that exception a lot.
   
   So the only time Hadoop should show you this bad disk is if yarn doesn't 
detect it or it if goes bad during the running of the container.  YARN has a 
specific feature to detect bad disks and will not give that to the container if 
they are bad.  So in your case are you executors very long running?  Are you 
using the yarn feature?  
   I'm not necessarily against this idea as disks can go bad while executors 
are running but just want to check to see how much this is really happening.  
What happens when we go to rename/merge the temp file to final location?  the 
shuffle file name is static so should hash to same dir every time unless we are 
adding different dir.  I can't remember that code off the top of my head.  With 
the external shuffle service, the application registers what directories its 
using such that the external shuffle service can use those to find the files 
again, I'm wondering if the temp ones might work but then fail later on the 
static names.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to