[GitHub] [iceberg] jackye1995 commented on pull request #1573: Add AWS subproject and initial S3FileIO implementation

GitBox Thu, 29 Oct 2020 15:05:25 -0700


jackye1995 commented on pull request #1573:
URL: https://github.com/apache/iceberg/pull/1573#issuecomment-719050513



   > > @danielcweeks, I did a fairly thorough review and found some problems in 
the read path (not updating positions, seek falling back to new connection). 
Those also expose some test gaps.
   > > Otherwise, there are two things I'd fix:
   > > 
   > > * It would be really nice to depend on only one version of the AWS SDK. 
If the URI class is only used to parse path and bucket, can we replace it with 
something? @jackye1995, any ideas?
   > > * Implement `create`. We can do this in a follow-up, but it seems like 
it will bite us if we don't and forget
   > 
   > I expanded the read path testing to cover a number of other cases.
   > 
   > I agree it's not ideal to depend on v1 and v2 of the SDK, but they are 
intended to coexist and from what I can see there isn't a current alternative 
to the `AmazonS3URI` class, I'd rather not create a utility just to mimic that 
behavior.
   > 
   > I actually disagree with implementing `create()` because, as I mentioned 
earlier, just about any scenario depending on "create if not exists" behavior 
is risky with S3. Considering that the object doesn't even appear until a 
stream is closed, which can be any amount of time, it leads to any number of 
potential race conditions. Seeing as how `create()` is not used anywhere 
currently, we should leave this unimplemented until the need arises and revisit 
this decision at that time.
   
   Sorry for the late reply, somehow I missed the notification...
   
   I personally think it makes more sense to implement our own version of 
`AmazonS3URI` as Ryan suggested. Here are my reasons:
   
   1. As mentioned in the AWS SDK Github issue you cited, the use of S3 URI is 
chaotic. There are multiple different versions of the URI string and the old 
`AmazonS3URI` only supports a very specific subset. So it's understandable to 
not have a standard `AmazonS3URI` class provided by the new AWS SDK. If you 
look into the code of the class, it's just parsing a string using the regex 
pattern `^(.+\\.)?s3[.-]([a-z0-9-]+)\\.`, so it is not hard to have a custom 
URI parser on Iceberg side to do the work.
   2. some customers are now using `s3a://` URI to be able to use 
`HadoopFileIO`, so we might need to add additional support to also accept this 
type of URI for an easy migration to `S3FileIO`. Currently using `AmazonS3URI` 
will throw exception for this use case. Implementing our own version of URI 
parser would be an easy way to achieve this.
   3. After all, `S3FileIO` will only write to S3. If a user chooses to use 
this FileIO, it might even make sense to accept a file path instead of a URI. 
For example, simply use `/my-bucket/my/file/path` to write to a S3 location. 
Currently `HadoopFileIO` already has this behavior, where a file path will 
write to the default file system specified in Hadoop config.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 commented on pull request #1573: Add AWS subproject and initial S3FileIO implementation

Reply via email to