vanzin commented on issue #23647: [SPARK-26712]Support multi directories for executor shuffle info recovery in yarn shuffle serivce URL: https://github.com/apache/spark/pull/23647#issuecomment-458271351 I'm very torn on this. It makes sense to try to use a better disk, but then the NM itself doesn't do that. So if the recovery dir is bust, then the NM will be affected regardless of this. I feels to me like enabling the option in SPARK-16505 is the right thing. If your recovery dir is bad, then the NM shouldn't be running until that is fixed. But that also assumes that the failure is detected during shuffle service initialization, and not later. If implementing multi-disk supports, I'm also not sure even how you'd do it. Opening the DB may or may not work, depending on how bad the disk is. So if the first time it does not work, and you write the recovery db to some other directory, but then the NM crashes (e.g. because of the bad disk) and the next time, opening the DB actually works in the first try, you'll end up reading stale data before you realize you're reading from the bad disk. I see you have checks for the last mod time, but even that can cause troubles in a scenario where the failure may or may not happen depending on when you look... I tend to think that if your recovery disk is bad, that should be treated as a catastrophic failure, and trying to work around that is kinda pointless. What you could do is try to keep running in spite of the bad disk, e.g. by only keeping data in memory. You'd only see problems when the NM is restarted (you'd lose existing state), but at that point Spark's retry mechanism should fix things.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
