steveloughran commented on pull request #34864: URL: https://github.com/apache/spark/pull/34864#issuecomment-1005770261
Obviously I am biased, but I believe that rather than trying to use the AWS APIs yourself, you should just use the hadoop file system APIs and interact with S3 through the s3a connector. For a high-performance upload of a local file, use `FileSystem.copyFromLocalFile` in s3a on hadoop 3.3.2 this uses the same transfer manager class as this PR does but adds: exception handling/mapping, encryption settings, auditing. And the s3a integration tests verify all this works... By the time you get to use it here you can assume the S3 upload works, and it becomes a matter of linking it up to spark. As `copyFromLocalFile` is implemented for all filesystems, it means the component will also work with other stores including google cloud and azure abfs, even if they do not override the base method for a high-performance implementation -yet. This also means that you could write tests for the feature using file:// as the destination store and include these in the spark module; if you design such tests to be overrideable to work with other file systems, they could be picked up and reused as the actual integration test suites in an external module. And, because someone else owns the problem of the s3 connector binding, you get to avoid fielding support calls related to configuring of AWS endpoint, region, support for third-party s3 stores, qualifying AWS SDK updates, etc. Accordingly, I would propose * ripping out `StarS3ShuffleFileManager` and using FileSystem APIs * write tests which can be pointed at the local FS, but then targeted at real object stores in a downstream test module, as https://github.com/hortonworks-spark/cloud-integration does for committer testing. Getting integration tests set up is inevitably going to be somewhat complicated. I can provide a bit of consultation there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
