Echoing Laurent and Igor I wonder what the consequence of adding 'gs://'
scheme to S3FileIO is if that scheme is already used by the hadoop gcs
connector? Do we want to overload that scheme? I would almost think it
should be an s3:// scheme or so right?

Best,
Ryan

On Fri, Dec 3, 2021 at 9:26 AM Mayur Srivastava <
mayur.srivast...@twosigma.com> wrote:

> Jack, https://github.com/apache/iceberg/pull/3656 is enough for my use
> case (because we are creating our own S3Client).
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Igor Dvorzhak <i...@google.com.INVALID>
> *Sent:* Thursday, December 2, 2021 8:12 PM
> *To:* dev@iceberg.apache.org
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> As long as proposed changes will not prevent Iceberg from using GCS
> connector (https://github.com/GoogleCloudDataproc/hadoop-connectors)
> via HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users
> to use S3FileIO with GCS.
>
>
>
> On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon <laur...@dremio.com> wrote:
>
> What about credentials? Sure, GCS has a S3 compatibility mode, but the
> gs:// URI used by Hadoop is native GCS support with Google authentication
> mechanisms (GCS Hadoop filesystem is actually out of tree ->
> https://github.com/GoogleCloudDataproc/hadoop-connectors)
>
>
>
> Laurent
>
>
>
> On Thu, Dec 2, 2021 at 3:05 PM Jack Ye <yezhao...@gmail.com> wrote:
>
> Also https://github.com/apache/iceberg/pull/3658.
>
>
>
> Please let me know if these are enough, we can discuss in the PRs. It
> would also be great if there are users of systems like MinIO to confirm.
>
>
>
> -Jack
>
>
>
> On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
> Looks like Jack is already on the top of the problem (
> https://github.com/apache/iceberg/pull/3656). Thanks Jack!
>
>
>
> *From:* Mayur Srivastava <mayur.srivast...@twosigma.com>
> *Sent:* Thursday, December 2, 2021 4:16 PM
> *To:* dev@iceberg.apache.org
> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> There are three reasons why we want to use S3FileIO over HadoopFileIO:
>
> 1.      We want access to the S3Client in our service so support some
> special handling of the auth. This is not possible with the HadoopFileIO
> because the S3Client is not exposed.
>
> 2.      We would like to improve upon the S3FileIO in the future, by
> introducing a vectorized IO mechanism and it makes is easier if we are
> already using S3FileIO. I’ll post my thoughts about the vectorized IO in a
> later email in upcoming weeks.
>
> 3.      As Ryan mentioned earlier, we are seeing very high memory usage
> with the HadoopFileIO in case of high concurrent commits. I reported that
> in another thread.
>
>
>
> To moving forward:
>
>
>
> Can we start by adding ‘gs’ to the S3URI’s valid prefixes?
>
>
>
> One of Jack’s suggestion was to remove any scheme check from the S3URI.
> Given we are building ResolvingFileIO, I think removing scheme check in the
> individual implementation is not a bad idea.
>
>
>
> Either solution will work for us.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Ryan Blue <b...@tabular.io>
> *Sent:* Thursday, December 2, 2021 11:37 AM
> *To:* Iceberg Dev List <dev@iceberg.apache.org>
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> I think the advantage of S3FileIO over HadoopFileIO with s3a is it doesn't
> hit the memory consumption problem that Mayur posted to the list. That's a
> fairly big advantage so I think it's reasonable to try to support this in
> 0.13.0.
>
>
>
> It should be easy enough to add the gs scheme and then we can figure out
> how we want to handle ResolvingFileIO. Jack's plan seems reasonable to me,
> so I guess we'll be adding scheme to implementation customization sooner
> than I thought!
>
>
>
> Ryan
>
>
>
> On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
> Hi
>
>
>
> I agree that endpoint, credentials, path style access etc. should be
> configurable.
>
> There are storages which are primarily used as "s3 compatible" and they
> need these settings to make them work.
>
> We've seen these being used to access MinIO, Ceph and even S3 with some
> gateway (i am light on details, sorry).
>
> In all these cases, users seem to use s3:// urls even if not talking to
> actual AWS S3 service.
>
>
>
> If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO,
> just by accepting gs:// protocol and delegating to S3FileIO for now.
>
> In the long term, i would recommend using native GCS client though, or
> hadoop file system implementation provided by google.
>
>
>
> BTW, Mayur what is the advantage of using S3FileIO for google storage
> vs HadoopFileIO?
>
>
>
> BR
>
> PF
>
>
>
>
>
>
>
>
>
> On Thu, Dec 2, 2021 at 1:30 AM Jack Ye <yezhao...@gmail.com> wrote:
>
> And here is a proposal of what I think could be the best way to go for
> both worlds:
>
> (1) remove URI restrictions in S3FileIO (or allow configuration of
> additional accepted schemes), and allow direct user configuration of
> endpoint, credentials, etc. to make S3 configuration simpler without the
> need to reconfigure the entire client.
>
> (2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO,
> others -> HadoopFileIO
>
> (3) for s3 and gs, ResolvingFileIO needs to develop the ability to
> initialize S3FileIO differently, and users should be able to configure them
> differently in catalog properties
>
> (4) for users that need special GCS unique features, a GCSFileIO could
> eventually be developed, and then people can choose to map gs -> GCSFileIO
> in ResolvingFileIO
>
>
>
> -Jack
>
>
>
>
>
> On Wed, Dec 1, 2021 at 4:14 PM Jack Ye <yezhao...@gmail.com> wrote:
>
> Thanks for the confirmation, this is as I expected. We had a similar case
> for Dell EMC ECS recently, where they published a version of their FileIO
> that works through S3FileIO (https://github.com/apache/iceberg/pull/2807)
> and the only thing needed was to override the endpoint, region and
> credentials. They also proposed some specialization because their object
> storage service is specialized with the Append operation when writing data.
> However, in the end they ended up just creating another FileIO (
> https://github.com/apache/iceberg/pull/3376) using their own SDK to
> better support the specialization.
>
>
>
> I believe the recent addition of ResolvingFileIO was to support using
> multiple FileIOs and switch between them based on the file scheme. If we
> continue that path, it feels more reasonable to me that we will have
> specialized FileIOs for each implementation and allow them to evolve
> independently. Users will be able to set whatever specialized
> configurations for each implementation and take advantage of all of them.
>
>
>
> On the other hand, if we can support using S3FileIO as the new standard
> FileIO that works with multiple storage providers, the advantages I see
> are:
>
> (1) simple from the user's perspective because the least common
> denominator of all storages needed by many cloud storage service providers
> is S3. It's more work to configure and maintain multiple FileIOs.
>
> (2) we can avoid the current check in ResolvingFileIO of the file scheme
> for each file path string, which might lead to some performance gain,
> although I do not know how much we gain in this process
>
>
>
> From a technical perspective I prefer having dedicated FileIOs and an
> overall ResolvingFileIO, because the Iceberg's FileIO interface is simple
> enough for people to build specialized and proper support for different
> storage systems. But it's also very tempting to just reuse the same thing
> instead of building another one, especially when that feature is lacking
> and the current functionality could be easily extended to support the
> feature. The concern is that we will end up like Hadoop that had to develop
> another sub-layer of FileSystem interface to accommodate different unique
> features of different storage providers when the specialized feature
> request comes, and at that time there is no difference from the dedicated
> FileIO + ResolvingFileIO architecture.
>
>
>
> I wonder what Daniel thinks about this since I believe he is more
> interested in multi-cloud support.
>
>
>
> -Jack
>
>
>
> On Wed, Dec 1, 2021 at 3:18 PM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
> Hi Jack, Daniel,
>
>
>
> We use several S3-compatible backends with Iceberg, these include S3, GCS,
> and others. Currently, S3FileIO provides us all the functionality we need
> Iceberg to talk to these backends. The way we create S3FileIO is via the
> constructor and provide the S3Client as the constructor param; we do not
> use the initialize(Map<String,String>) method in FileIO. Our custom catalog
> accepts the FileIO object at creation time. To talk to GCS, we create the
> S3Client with a few overrides (described below) and pass it to S3FileIO.
> After that, the rest of the S3FileIO code works as is. The only exception
> is that “gs” (used by GCS URIs) needs to be accepted as a valid S3 prefix.
> This is the reason I sent the email.
>
>
>
> The reason why we want to use S3FileIO to talk to GCS is that S3FileIO
> almost works out of the box and contains all the functionality needed to
> talk to GCS. The only special requirement is the creation of the S3Client
> and allow “gs” prefix in the URIs. Based on our early experiments and
> benchmarks, S3FileIO provides all the functionality we need and performs
> well, so we didn’t see a need to create a native GCS FileIO. Iceberg
> operations that we need are create, drop, read and write objects from S3
> and S3FileIO provides this functionality.
>
>
>
> We are managing ACLs (IAM in case of GCS) at the bucket level and that
> happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve
> not experimented with ACLs or encryption with S3FileIO and that is a good
> question whether it works with GCS. But, if these features are not enabled
> via default settings, S3FileIO works just fine with GCS.
>
>
>
> I think there is a case for supporting S3-compatible backends in S3FileIO
> because a lot of the code is common. The question is whether we can cleanly
> expose the common S3FileIO code to work with these backends and separate
> out any specialization (if required) OR we want to have a different FileIO
> implementation for each of the other S3 compatible backends such as GCS?
> I’m eager to hear more from the community about this. I’m happy to discuss
> and follow long-term design direction of the Iceberg community.
>
>
>
> The S3Client for GCS is created as follows (currently the code is not open
> source so I’m sharing the steps only):
>
> 1. Create S3ClientBuilder.
>
> 2. Set GCS endpoint URI and region.
>
> 3. Set a credentials provider that returns null. You can set credentials
> here if you have static credentials.
>
> 4. Set ClientOverrideConfiguration with interceptors in the
> overrideConfiguration(). The interceptors are used to setup authorization
> header in requests (setting projectId, auth tokens, etc.) and do header
> translation for requests and responses.
>
> 5. Build the S3Client.
>
> 6. Pass the S3Client to S3FileIO.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Jack Ye <yezhao...@gmail.com>
> *Sent:* Wednesday, December 1, 2021 1:16 PM
> *To:* Iceberg Dev List <dev@iceberg.apache.org>
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> Hi Mayur,
>
>
>
> I know many object storage services have allowed communication using the
> Amazon S3 client by implementing the same protocol, like recently the Dell
> EMC ECS and Aliyun OSS. But ultimately there are functionality differences
> that could be optimized with a native FileIO, and the 2 examples I listed
> before both contributed their own FileIO implementations to Iceberg
> recently. I would imagine some native S3 features like ACL or SSE to not
> work for GCS, and some GCS features to be not supported in S3FileIO, so I
> think a specific GCS FileIO would likely be better for GCS support in the
> long term.
>
>
>
> Could you describe how you configure S3FileIO to talk to GCS? Do you need
> to override the S3 endpoint or have any other configurations?
>
>
>
> And I am not an expert of GCS, do you see using S3FileIO for GCS as a
> feasible long-term solution? Are there any GCS specific features that you
> might need and could not be done through S3FileIO, and how widely used are
> those features?
>
>
>
> Best,
>
> Jack Ye
>
>
>
>
>
>
>
> On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks <daniel.c.we...@gmail.com>
> wrote:
>
> The S3FileIO does use the AWS S3 V2 Client libraries and while there
> appears to be some level of compatibility, it's not clear to me how far
> that currently extends (some AWS features like encryption, IAM, etc. may
> not have full support).
>
>
>
> I think it's great that there may be a path for more native GCS FileIO
> support, but it might be a little early to rename the classes and except
> that everything will work cleanly.
>
>
>
> Thanks for pointing this out, Mayur.  It's really an interesting
> development.
>
>
>
> -Dan
>
>
>
> On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
> if S3FileIO is supposed to be used with other file systems, we should
> consider proper class renames.
>
> just my 2c
>
>
>
> On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
> Hi,
>
>
>
> We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible
> with the AWS S3 SDKs and if they are added to the list of supported
> prefixes, they work with S3FileIO.
>
>
>
> Thanks,
>
> Mayur
>
>
>
> *From:* Piotr Findeisen <pi...@starburstdata.com>
> *Sent:* Wednesday, December 1, 2021 10:58 AM
> *To:* Iceberg Dev List <dev@iceberg.apache.org>
> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3
> Storage
>
>
>
> Hi
>
>
>
> Just curious. S3URI seems aws s3-specific. What would be the goal of using
> S3URI with google cloud storage urls?
>
> what problem are we solving?
>
>
>
> PF
>
>
>
>
>
> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
> Sounds reasonable to me if they are compatible
>
>
>
> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava <
> mayur.srivast...@twosigma.com> wrote:
>
> Hi,
>
>
>
> We have URIs starting with gs:// representing objects on GCS. Currently,
> S3URI doesn’t support gs:// prefix (see
> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
> Is there an existing JIRA for supporting this? Any objections to add “gs”
> to the list of S3 prefixes?
>
>
>
> Thanks,
>
> Mayur
>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Tabular
>
>

Reply via email to