hiboyang opened a new pull request #34864: URL: https://github.com/apache/spark/pull/34864
### What changes were proposed in this pull request? This PR provides support to store shuffle files on external shuffle storage like S3. It helps Dynamic Allocation on Kubernetes. Spark driver could release idle executors without worrying about losing shuffle data because the shuffle data is store on external shuffle storage which are different from executors. This could be viewed as a followup work for https://issues.apache.org/jira/browse/SPARK-25299. ### Why are the changes needed? To better support Dynamic Allocation on Kubernetes, we need to decouple shuffle data from Spark executor. This PR implements another Shuffle Manager and support writing shuffle data on S3. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds two Spark config like following to plug in another StarShuffleManager and store shuffle data on provided S3 location. ``` spark.shuffle.manager=org.apache.spark.shuffle.StarShuffleManager spark.shuffle.star.rootDir=s3://my_bucket_name/my_shuffle_folder ``` ### How was this patch tested? Added a unit test for StartShuffleManager. A lot of classes are copied from Spark, thus not add tests for those classes. We will work with the community to get feedback first, then work on removing code copy/duplication. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
