prefix in S3URI for Google Cloud S3 Storage

Laurent Goujon Thu, 02 Dec 2021 15:15:33 -0800

What about credentials? Sure, GCS has a S3 compatibility mode, but the
gs:// URI used by Hadoop is native GCS support with Google authentication
mechanisms (GCS Hadoop filesystem is actually out of tree ->
https://github.com/GoogleCloudDataproc/hadoop-connectors)


Laurent

On Thu, Dec 2, 2021 at 3:05 PM Jack Ye <yezhao...@gmail.com> wrote:

> Also https://github.com/apache/iceberg/pull/3658.
>
> Please let me know if these are enough, we can discuss in the PRs. It
> would also be great if there are users of systems like MinIO to confirm.
>
> -Jack
>
> On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
>> Looks like Jack is already on the top of the problem (
>> https://github.com/apache/iceberg/pull/3656). Thanks Jack!
>>
>>
>>
>> *From:* Mayur Srivastava <mayur.srivast...@twosigma.com>
>> *Sent:* Thursday, December 2, 2021 4:16 PM
>> *To:* dev@iceberg.apache.org
>> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> There are three reasons why we want to use S3FileIO over HadoopFileIO:
>>
>> 1.      We want access to the S3Client in our service so support some
>> special handling of the auth. This is not possible with the HadoopFileIO
>> because the S3Client is not exposed.
>>
>> 2.      We would like to improve upon the S3FileIO in the future, by
>> introducing a vectorized IO mechanism and it makes is easier if we are
>> already using S3FileIO. I’ll post my thoughts about the vectorized IO in a
>> later email in upcoming weeks.
>>
>> 3.      As Ryan mentioned earlier, we are seeing very high memory usage
>> with the HadoopFileIO in case of high concurrent commits. I reported that
>> in another thread.
>>
>>
>>
>> To moving forward:
>>
>>
>>
>> Can we start by adding ‘gs’ to the S3URI’s valid prefixes?
>>
>>
>>
>> One of Jack’s suggestion was to remove any scheme check from the S3URI.
>> Given we are building ResolvingFileIO, I think removing scheme check in the
>> individual implementation is not a bad idea.
>>
>>
>>
>> Either solution will work for us.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Ryan Blue <b...@tabular.io>
>> *Sent:* Thursday, December 2, 2021 11:37 AM
>> *To:* Iceberg Dev List <dev@iceberg.apache.org>
>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> I think the advantage of S3FileIO over HadoopFileIO with s3a is it
>> doesn't hit the memory consumption problem that Mayur posted to the list.
>> That's a fairly big advantage so I think it's reasonable to try to support
>> this in 0.13.0.
>>
>>
>>
>> It should be easy enough to add the gs scheme and then we can figure out
>> how we want to handle ResolvingFileIO. Jack's plan seems reasonable to me,
>> so I guess we'll be adding scheme to implementation customization sooner
>> than I thought!
>>
>>
>>
>> Ryan
>>
>>
>>
>> On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>> Hi
>>
>>
>>
>> I agree that endpoint, credentials, path style access etc. should be
>> configurable.
>>
>> There are storages which are primarily used as "s3 compatible" and they
>> need these settings to make them work.
>>
>> We've seen these being used to access MinIO, Ceph and even S3 with some
>> gateway (i am light on details, sorry).
>>
>> In all these cases, users seem to use s3:// urls even if not talking to
>> actual AWS S3 service.
>>
>>
>>
>> If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO,
>> just by accepting gs:// protocol and delegating to S3FileIO for now.
>>
>> In the long term, i would recommend using native GCS client though, or
>> hadoop file system implementation provided by google.
>>
>>
>>
>> BTW, Mayur what is the advantage of using S3FileIO for google storage
>> vs HadoopFileIO?
>>
>>
>>
>> BR
>>
>> PF
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Dec 2, 2021 at 1:30 AM Jack Ye <yezhao...@gmail.com> wrote:
>>
>> And here is a proposal of what I think could be the best way to go for
>> both worlds:
>>
>> (1) remove URI restrictions in S3FileIO (or allow configuration of
>> additional accepted schemes), and allow direct user configuration of
>> endpoint, credentials, etc. to make S3 configuration simpler without the
>> need to reconfigure the entire client.
>>
>> (2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO,
>> others -> HadoopFileIO
>>
>> (3) for s3 and gs, ResolvingFileIO needs to develop the ability to
>> initialize S3FileIO differently, and users should be able to configure them
>> differently in catalog properties
>>
>> (4) for users that need special GCS unique features, a GCSFileIO could
>> eventually be developed, and then people can choose to map gs -> GCSFileIO
>> in ResolvingFileIO
>>
>>
>>
>> -Jack
>>
>>
>>
>>
>>
>> On Wed, Dec 1, 2021 at 4:14 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>> Thanks for the confirmation, this is as I expected. We had a similar case
>> for Dell EMC ECS recently, where they published a version of their FileIO
>> that works through S3FileIO (https://github.com/apache/iceberg/pull/2807)
>> and the only thing needed was to override the endpoint, region and
>> credentials. They also proposed some specialization because their object
>> storage service is specialized with the Append operation when writing data.
>> However, in the end they ended up just creating another FileIO (
>> https://github.com/apache/iceberg/pull/3376) using their own SDK to
>> better support the specialization.
>>
>>
>>
>> I believe the recent addition of ResolvingFileIO was to support using
>> multiple FileIOs and switch between them based on the file scheme. If we
>> continue that path, it feels more reasonable to me that we will have
>> specialized FileIOs for each implementation and allow them to evolve
>> independently. Users will be able to set whatever specialized
>> configurations for each implementation and take advantage of all of them.
>>
>>
>>
>> On the other hand, if we can support using S3FileIO as the new standard
>> FileIO that works with multiple storage providers, the advantages I see
>> are:
>>
>> (1) simple from the user's perspective because the least common
>> denominator of all storages needed by many cloud storage service providers
>> is S3. It's more work to configure and maintain multiple FileIOs.
>>
>> (2) we can avoid the current check in ResolvingFileIO of the file scheme
>> for each file path string, which might lead to some performance gain,
>> although I do not know how much we gain in this process
>>
>>
>>
>> From a technical perspective I prefer having dedicated FileIOs and an
>> overall ResolvingFileIO, because the Iceberg's FileIO interface is simple
>> enough for people to build specialized and proper support for different
>> storage systems. But it's also very tempting to just reuse the same thing
>> instead of building another one, especially when that feature is lacking
>> and the current functionality could be easily extended to support the
>> feature. The concern is that we will end up like Hadoop that had to develop
>> another sub-layer of FileSystem interface to accommodate different unique
>> features of different storage providers when the specialized feature
>> request comes, and at that time there is no difference from the dedicated
>> FileIO + ResolvingFileIO architecture.
>>
>>
>>
>> I wonder what Daniel thinks about this since I believe he is more
>> interested in multi-cloud support.
>>
>>
>>
>> -Jack
>>
>>
>>
>> On Wed, Dec 1, 2021 at 3:18 PM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>> Hi Jack, Daniel,
>>
>>
>>
>> We use several S3-compatible backends with Iceberg, these include S3,
>> GCS, and others. Currently, S3FileIO provides us all the functionality we
>> need Iceberg to talk to these backends. The way we create S3FileIO is via
>> the constructor and provide the S3Client as the constructor param; we do
>> not use the initialize(Map<String,String>) method in FileIO. Our custom
>> catalog accepts the FileIO object at creation time. To talk to GCS, we
>> create the S3Client with a few overrides (described below) and pass it to
>> S3FileIO. After that, the rest of the S3FileIO code works as is. The only
>> exception is that “gs” (used by GCS URIs) needs to be accepted as a valid
>> S3 prefix. This is the reason I sent the email.
>>
>>
>>
>> The reason why we want to use S3FileIO to talk to GCS is that S3FileIO
>> almost works out of the box and contains all the functionality needed to
>> talk to GCS. The only special requirement is the creation of the S3Client
>> and allow “gs” prefix in the URIs. Based on our early experiments and
>> benchmarks, S3FileIO provides all the functionality we need and performs
>> well, so we didn’t see a need to create a native GCS FileIO. Iceberg
>> operations that we need are create, drop, read and write objects from S3
>> and S3FileIO provides this functionality.
>>
>>
>>
>> We are managing ACLs (IAM in case of GCS) at the bucket level and that
>> happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve
>> not experimented with ACLs or encryption with S3FileIO and that is a good
>> question whether it works with GCS. But, if these features are not enabled
>> via default settings, S3FileIO works just fine with GCS.
>>
>>
>>
>> I think there is a case for supporting S3-compatible backends in S3FileIO
>> because a lot of the code is common. The question is whether we can cleanly
>> expose the common S3FileIO code to work with these backends and separate
>> out any specialization (if required) OR we want to have a different FileIO
>> implementation for each of the other S3 compatible backends such as GCS?
>> I’m eager to hear more from the community about this. I’m happy to discuss
>> and follow long-term design direction of the Iceberg community.
>>
>>
>>
>> The S3Client for GCS is created as follows (currently the code is not
>> open source so I’m sharing the steps only):
>>
>> 1. Create S3ClientBuilder.
>>
>> 2. Set GCS endpoint URI and region.
>>
>> 3. Set a credentials provider that returns null. You can set credentials
>> here if you have static credentials.
>>
>> 4. Set ClientOverrideConfiguration with interceptors in the
>> overrideConfiguration(). The interceptors are used to setup authorization
>> header in requests (setting projectId, auth tokens, etc.) and do header
>> translation for requests and responses.
>>
>> 5. Build the S3Client.
>>
>> 6. Pass the S3Client to S3FileIO.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Jack Ye <yezhao...@gmail.com>
>> *Sent:* Wednesday, December 1, 2021 1:16 PM
>> *To:* Iceberg Dev List <dev@iceberg.apache.org>
>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> Hi Mayur,
>>
>>
>>
>> I know many object storage services have allowed communication using the
>> Amazon S3 client by implementing the same protocol, like recently the Dell
>> EMC ECS and Aliyun OSS. But ultimately there are functionality differences
>> that could be optimized with a native FileIO, and the 2 examples I listed
>> before both contributed their own FileIO implementations to Iceberg
>> recently. I would imagine some native S3 features like ACL or SSE to not
>> work for GCS, and some GCS features to be not supported in S3FileIO, so I
>> think a specific GCS FileIO would likely be better for GCS support in the
>> long term.
>>
>>
>>
>> Could you describe how you configure S3FileIO to talk to GCS? Do you need
>> to override the S3 endpoint or have any other configurations?
>>
>>
>>
>> And I am not an expert of GCS, do you see using S3FileIO for GCS as a
>> feasible long-term solution? Are there any GCS specific features that you
>> might need and could not be done through S3FileIO, and how widely used are
>> those features?
>>
>>
>>
>> Best,
>>
>> Jack Ye
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks <daniel.c.we...@gmail.com>
>> wrote:
>>
>> The S3FileIO does use the AWS S3 V2 Client libraries and while there
>> appears to be some level of compatibility, it's not clear to me how far
>> that currently extends (some AWS features like encryption, IAM, etc. may
>> not have full support).
>>
>>
>>
>> I think it's great that there may be a path for more native GCS FileIO
>> support, but it might be a little early to rename the classes and except
>> that everything will work cleanly.
>>
>>
>>
>> Thanks for pointing this out, Mayur.  It's really an interesting
>> development.
>>
>>
>>
>> -Dan
>>
>>
>>
>> On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>> if S3FileIO is supposed to be used with other file systems, we should
>> consider proper class renames.
>>
>> just my 2c
>>
>>
>>
>> On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>> Hi,
>>
>>
>>
>> We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible
>> with the AWS S3 SDKs and if they are added to the list of supported
>> prefixes, they work with S3FileIO.
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>> *From:* Piotr Findeisen <pi...@starburstdata.com>
>> *Sent:* Wednesday, December 1, 2021 10:58 AM
>> *To:* Iceberg Dev List <dev@iceberg.apache.org>
>> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
>> Storage
>>
>>
>>
>> Hi
>>
>>
>>
>> Just curious. S3URI seems aws s3-specific. What would be the goal of
>> using S3URI with google cloud storage urls?
>>
>> what problem are we solving?
>>
>>
>>
>> PF
>>
>>
>>
>>
>>
>> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer <russell.spit...@gmail.com>
>> wrote:
>>
>> Sounds reasonable to me if they are compatible
>>
>>
>>
>> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
>> mayur.srivast...@twosigma.com> wrote:
>>
>> Hi,
>>
>>
>>
>> We have URIs starting with gs:// representing objects on GCS. Currently,
>> S3URI doesn’t support gs:// prefix (see
>> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
>> Is there an existing JIRA for supporting this? Any objections to add “gs”
>> to the list of S3 prefixes?
>>
>>
>>
>> Thanks,
>>
>> Mayur
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Tabular
>>
>

Re: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

Reply via email to