hiboyang commented on pull request #31715: URL: https://github.com/apache/spark/pull/31715#issuecomment-790199017
> Users may need to set up this application config differently across different solutions, e.g. external shuffle service, built-in shuffle service, remote shuffle service, mixed solution, etc. This is somehow low-level Spark behavior, and I'm suspicious it is good to expose it to end users and let them decide the config. It sounds easy to set a improper value. > > Can Spark decide to unregister shuffle output automatically? Like based on which shuffle manager is used for shuffle output? or like @attilapiros's idea, to have a property somewhere close to MapStatus? If Spark decides to unregister shuffle output based on which shuffle manager is used, that requires Spark has knowledge about different shuffle manager implementation. It is hard to implement because user could set any shuffle manager implementation by spark.shuffle.manager. In terms of "Users may need to set up this application config differently across different solutions", yes, this is the purpose. There are many shuffle solutions as you listed. Current Spark design is pretty good, allowing user to set spark.shuffle.manager with customized class to choose different solution. However, it assumes shuffle file always lost when executor is lost. This assumption conflicts with the customizable shuffle manager design. The new config spark.shuffle.markFileLostOnExecutorLost is to keep that assumption by default, but gives user the option to choose different solution when needed. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
