hiboyang commented on pull request #34864:
URL: https://github.com/apache/spark/pull/34864#issuecomment-1006171185


   > Obviously I am biased, but I believe that rather than trying to use the 
AWS APIs yourself, you should just use the hadoop file system APIs and interact 
with S3 through the s3a connector.
   > 
   > For a high-performance upload of a local file, use 
`FileSystem.copyFromLocalFile` in s3a on hadoop 3.3.2 this uses the same 
transfer manager class as this PR does but adds: exception handling/mapping, 
encryption settings, auditing. And the s3a integration tests verify all this 
works... By the time you get to use it here you can assume the S3 upload works, 
and it becomes a matter of linking it up to spark.
   > 
   > As `copyFromLocalFile` is implemented for all filesystems, it means the 
component will also work with other stores including google cloud and azure 
abfs, even if they do not override the base method for a high-performance 
implementation -yet.
   > 
   > This also means that you could write tests for the feature using file:// 
as the destination store and include these in the spark module; if you design 
such tests to be overrideable to work with other file systems, they could be 
picked up and reused as the actual integration test suites in an external 
module.
   > 
   > And, because someone else owns the problem of the s3 connector binding, 
you get to avoid fielding support calls related to configuring of AWS endpoint, 
region, support for third-party s3 stores, qualifying AWS SDK updates, etc.
   > 
   > Accordingly, I would propose
   > 
   > * ripping out `StarS3ShuffleFileManager` and using FileSystem APIs
   > * write tests which can be pointed at the local FS, but then targeted at 
real object stores in a downstream test module, as 
https://github.com/hortonworks-spark/cloud-integration does for committer 
testing.
   > 
   > Getting integration tests set up is inevitably going to be somewhat 
complicated. I can provide a bit of consultation there.
   
   Yes, these are great suggestions! Thanks again! I will find time to make 
change for this, and may also reach out to your for consultation when adding 
integration test :)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to