hiboyang opened a new pull request #34864:
URL: https://github.com/apache/spark/pull/34864


   ### What changes were proposed in this pull request?
   This PR ([design 
doc](https://docs.google.com/document/d/10rhvjXUlbQfWg-zh02_aqRqDT_ZnwYmICPAR--aRv64))
 provides support to store shuffle files on external shuffle storage like S3. 
It helps Dynamic
   Allocation on Kubernetes. Spark driver could release idle executors without 
worrying about losing
   shuffle data because the shuffle data is store on external shuffle storage 
which are different
   from executors.
   
   This could be viewed as a followup work for 
https://issues.apache.org/jira/browse/SPARK-25299.
   
   There is previously Worker Decommission feature 
([SPARK-33545](https://issues.apache.org/jira/browse/SPARK-33545)), which is a 
great feature to copy shuffle data to fallback storage like S3. People 
appreciate that work to address the critical issue to handle shuffle data on 
Spark executor termination. The work in the PR does not intent to replace that 
feature. The intent is to get further discussion about how to save shuffle data 
on S3 during normal execution time.
   
   ### Why are the changes needed?
   
   To better support Dynamic Allocation on Kubernetes, we need to decouple 
shuffle data from Spark
   executor. This PR implements another Shuffle Manager and support writing 
shuffle data on S3.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, this PR adds two Spark config like following to plug in another 
StarShuffleManager and store
   shuffle data on provided S3 location.
   ```
   spark.shuffle.manager=org.apache.spark.shuffle.StarShuffleManager
   spark.shuffle.star.rootDir=s3://my_bucket_name/my_shuffle_folder
   ```
   
   ### How was this patch tested?
   
   Added a unit test for StartShuffleManager. A lot of classes are copied from 
Spark, thus not add tests
   for those classes. We will work with the community to get feedback first, 
then work on removing code
   copy/duplication.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to