jackye1995 commented on pull request #1573: URL: https://github.com/apache/iceberg/pull/1573#issuecomment-719050513
> > @danielcweeks, I did a fairly thorough review and found some problems in the read path (not updating positions, seek falling back to new connection). Those also expose some test gaps. > > Otherwise, there are two things I'd fix: > > > > * It would be really nice to depend on only one version of the AWS SDK. If the URI class is only used to parse path and bucket, can we replace it with something? @jackye1995, any ideas? > > * Implement `create`. We can do this in a follow-up, but it seems like it will bite us if we don't and forget > > I expanded the read path testing to cover a number of other cases. > > I agree it's not ideal to depend on v1 and v2 of the SDK, but they are intended to coexist and from what I can see there isn't a current alternative to the `AmazonS3URI` class, I'd rather not create a utility just to mimic that behavior. > > I actually disagree with implementing `create()` because, as I mentioned earlier, just about any scenario depending on "create if not exists" behavior is risky with S3. Considering that the object doesn't even appear until a stream is closed, which can be any amount of time, it leads to any number of potential race conditions. Seeing as how `create()` is not used anywhere currently, we should leave this unimplemented until the need arises and revisit this decision at that time. Sorry for the late reply, somehow I missed the notification... I personally think it makes more sense to implement our own version of `AmazonS3URI` as Ryan suggested. Here are my reasons: 1. As mentioned in the AWS SDK Github issue you cited, the use of S3 URI is chaotic. There are multiple different versions of the URI string and the old `AmazonS3URI` only supports a very specific subset. So it's understandable to not have a standard `AmazonS3URI` class provided by the new AWS SDK. If you look into the code of the class, it's just parsing a string using the regex pattern `^(.+\\.)?s3[.-]([a-z0-9-]+)\\.`, so it is not hard to have a custom URI parser on Iceberg side to do the work. 2. some customers are now using `s3a://` URI to be able to use `HadoopFileIO`, so we might need to add additional support to also accept this type of URI for an easy migration to `S3FileIO`. Currently using `AmazonS3URI` will throw exception for this use case. Implementing our own version of URI parser would be an easy way to achieve this. 3. After all, `S3FileIO` will only write to S3. If a user chooses to use this FileIO, it might even make sense to accept a file path instead of a URI. For example, simply use `/my-bucket/my/file/path` to write to a S3 location. Currently `HadoopFileIO` already has this behavior, where a file path will write to the default file system specified in Hadoop config. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
