[GitHub] [spark] steveloughran commented on pull request #34864: [SHUFFLE] [WIP] Prototype: store shuffle file on external storage like S3

GitBox Wed, 05 Jan 2022 07:11:32 -0800


steveloughran commented on pull request #34864:
URL: https://github.com/apache/spark/pull/34864#issuecomment-1005770261



   
   Obviously I am biased, but I believe that rather than trying to use the AWS 
APIs yourself, you should just use the hadoop file system APIs and interact 
with S3 through the s3a connector.
   
   For a high-performance upload of a local file, use 
`FileSystem.copyFromLocalFile` in s3a on hadoop 3.3.2 this uses the same 
transfer manager class as this PR does but adds: exception handling/mapping, 
encryption settings, auditing. And the s3a integration tests verify all this 
works... By the time you get to use it here you can assume the S3 upload works, 
and it becomes a matter of linking it up to spark.
   
   
   As `copyFromLocalFile` is implemented for all filesystems, it means the 
component will also work with other stores including google cloud and azure 
abfs, even if they do not override the base method for a high-performance 
implementation -yet.
   
   This also means that you could write tests for the feature using file:// as 
the destination store and include these in the spark module; if you design such 
tests to be overrideable to work with other file systems, they could be picked 
up and reused as the actual integration test suites in an external module.
   
   And, because someone else owns the problem of the s3 connector binding, you 
get to avoid fielding support calls related to configuring of AWS endpoint, 
region, support for third-party s3 stores, qualifying AWS SDK updates, etc.
   
   Accordingly, I would propose 
   
   * ripping out `StarS3ShuffleFileManager` and using FileSystem APIs
   * write tests which can be pointed at the local FS, but then targeted at 
real object stores in a downstream test module, as 
https://github.com/hortonworks-spark/cloud-integration does for committer 
testing.
   
   Getting integration tests set up is inevitably going to be somewhat 
complicated. I can provide a bit of consultation there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on pull request #34864: [SHUFFLE] [WIP] Prototype: store shuffle file on external storage like S3

Reply via email to