hiboyang opened a new pull request #34864:
URL: https://github.com/apache/spark/pull/34864


   ### What changes were proposed in this pull request?
   This PR provides support to store shuffle files on external shuffle storage 
like S3. It helps Dynamic
   Allocation on Kubernetes. Spark driver could release idle executors without 
worrying about losing
   shuffle data because the shuffle data is store on external shuffle storage 
which are different
   from executors.
   
   This could be viewed as a followup work for 
https://issues.apache.org/jira/browse/SPARK-25299.
   
   ### Why are the changes needed?
   
   To better support Dynamic Allocation on Kubernetes, we need to decouple 
shuffle data from Spark
   executor. This PR implements another Shuffle Manager and support writing 
shuffle data on S3.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR adds two Spark config like following to plug in another 
StarShuffleManager and store
   shuffle data on provided S3 location.
   ```
   spark.shuffle.manager=org.apache.spark.shuffle.StarShuffleManager
   spark.shuffle.star.rootDir=s3://my_bucket_name/my_shuffle_folder
   ```
   
   ### How was this patch tested?
   
   Added a unit test for StartShuffleManager. A lot of classes are copied from 
Spark, thus not add tests
   for those classes. We will work with the community to get feedback first, 
then work on removing code
   copy/duplication.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to