Echoing Laurent and Igor I wonder what the consequence of adding 'gs://' scheme to S3FileIO is if that scheme is already used by the hadoop gcs connector? Do we want to overload that scheme? I would almost think it should be an s3:// scheme or so right?
Best, Ryan On Fri, Dec 3, 2021 at 9:26 AM Mayur Srivastava < mayur.srivast...@twosigma.com> wrote: > Jack, https://github.com/apache/iceberg/pull/3656 is enough for my use > case (because we are creating our own S3Client). > > > > Thanks, > > Mayur > > > > *From:* Igor Dvorzhak <i...@google.com.INVALID> > *Sent:* Thursday, December 2, 2021 8:12 PM > *To:* dev@iceberg.apache.org > *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 > Storage > > > > As long as proposed changes will not prevent Iceberg from using GCS > connector (https://github.com/GoogleCloudDataproc/hadoop-connectors) > via HCFS/HadoopFileIO to access GCS, I think that it is OK to allow users > to use S3FileIO with GCS. > > > > On Thu, Dec 2, 2021 at 3:15 PM Laurent Goujon <laur...@dremio.com> wrote: > > What about credentials? Sure, GCS has a S3 compatibility mode, but the > gs:// URI used by Hadoop is native GCS support with Google authentication > mechanisms (GCS Hadoop filesystem is actually out of tree -> > https://github.com/GoogleCloudDataproc/hadoop-connectors) > > > > Laurent > > > > On Thu, Dec 2, 2021 at 3:05 PM Jack Ye <yezhao...@gmail.com> wrote: > > Also https://github.com/apache/iceberg/pull/3658. > > > > Please let me know if these are enough, we can discuss in the PRs. It > would also be great if there are users of systems like MinIO to confirm. > > > > -Jack > > > > On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava < > mayur.srivast...@twosigma.com> wrote: > > Looks like Jack is already on the top of the problem ( > https://github.com/apache/iceberg/pull/3656). Thanks Jack! > > > > *From:* Mayur Srivastava <mayur.srivast...@twosigma.com> > *Sent:* Thursday, December 2, 2021 4:16 PM > *To:* dev@iceberg.apache.org > *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3 > Storage > > > > There are three reasons why we want to use S3FileIO over HadoopFileIO: > > 1. We want access to the S3Client in our service so support some > special handling of the auth. This is not possible with the HadoopFileIO > because the S3Client is not exposed. > > 2. We would like to improve upon the S3FileIO in the future, by > introducing a vectorized IO mechanism and it makes is easier if we are > already using S3FileIO. I’ll post my thoughts about the vectorized IO in a > later email in upcoming weeks. > > 3. As Ryan mentioned earlier, we are seeing very high memory usage > with the HadoopFileIO in case of high concurrent commits. I reported that > in another thread. > > > > To moving forward: > > > > Can we start by adding ‘gs’ to the S3URI’s valid prefixes? > > > > One of Jack’s suggestion was to remove any scheme check from the S3URI. > Given we are building ResolvingFileIO, I think removing scheme check in the > individual implementation is not a bad idea. > > > > Either solution will work for us. > > > > Thanks, > > Mayur > > > > *From:* Ryan Blue <b...@tabular.io> > *Sent:* Thursday, December 2, 2021 11:37 AM > *To:* Iceberg Dev List <dev@iceberg.apache.org> > *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 > Storage > > > > I think the advantage of S3FileIO over HadoopFileIO with s3a is it doesn't > hit the memory consumption problem that Mayur posted to the list. That's a > fairly big advantage so I think it's reasonable to try to support this in > 0.13.0. > > > > It should be easy enough to add the gs scheme and then we can figure out > how we want to handle ResolvingFileIO. Jack's plan seems reasonable to me, > so I guess we'll be adding scheme to implementation customization sooner > than I thought! > > > > Ryan > > > > On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen <pi...@starburstdata.com> > wrote: > > Hi > > > > I agree that endpoint, credentials, path style access etc. should be > configurable. > > There are storages which are primarily used as "s3 compatible" and they > need these settings to make them work. > > We've seen these being used to access MinIO, Ceph and even S3 with some > gateway (i am light on details, sorry). > > In all these cases, users seem to use s3:// urls even if not talking to > actual AWS S3 service. > > > > If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO, > just by accepting gs:// protocol and delegating to S3FileIO for now. > > In the long term, i would recommend using native GCS client though, or > hadoop file system implementation provided by google. > > > > BTW, Mayur what is the advantage of using S3FileIO for google storage > vs HadoopFileIO? > > > > BR > > PF > > > > > > > > > > On Thu, Dec 2, 2021 at 1:30 AM Jack Ye <yezhao...@gmail.com> wrote: > > And here is a proposal of what I think could be the best way to go for > both worlds: > > (1) remove URI restrictions in S3FileIO (or allow configuration of > additional accepted schemes), and allow direct user configuration of > endpoint, credentials, etc. to make S3 configuration simpler without the > need to reconfigure the entire client. > > (2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO, > others -> HadoopFileIO > > (3) for s3 and gs, ResolvingFileIO needs to develop the ability to > initialize S3FileIO differently, and users should be able to configure them > differently in catalog properties > > (4) for users that need special GCS unique features, a GCSFileIO could > eventually be developed, and then people can choose to map gs -> GCSFileIO > in ResolvingFileIO > > > > -Jack > > > > > > On Wed, Dec 1, 2021 at 4:14 PM Jack Ye <yezhao...@gmail.com> wrote: > > Thanks for the confirmation, this is as I expected. We had a similar case > for Dell EMC ECS recently, where they published a version of their FileIO > that works through S3FileIO (https://github.com/apache/iceberg/pull/2807) > and the only thing needed was to override the endpoint, region and > credentials. They also proposed some specialization because their object > storage service is specialized with the Append operation when writing data. > However, in the end they ended up just creating another FileIO ( > https://github.com/apache/iceberg/pull/3376) using their own SDK to > better support the specialization. > > > > I believe the recent addition of ResolvingFileIO was to support using > multiple FileIOs and switch between them based on the file scheme. If we > continue that path, it feels more reasonable to me that we will have > specialized FileIOs for each implementation and allow them to evolve > independently. Users will be able to set whatever specialized > configurations for each implementation and take advantage of all of them. > > > > On the other hand, if we can support using S3FileIO as the new standard > FileIO that works with multiple storage providers, the advantages I see > are: > > (1) simple from the user's perspective because the least common > denominator of all storages needed by many cloud storage service providers > is S3. It's more work to configure and maintain multiple FileIOs. > > (2) we can avoid the current check in ResolvingFileIO of the file scheme > for each file path string, which might lead to some performance gain, > although I do not know how much we gain in this process > > > > From a technical perspective I prefer having dedicated FileIOs and an > overall ResolvingFileIO, because the Iceberg's FileIO interface is simple > enough for people to build specialized and proper support for different > storage systems. But it's also very tempting to just reuse the same thing > instead of building another one, especially when that feature is lacking > and the current functionality could be easily extended to support the > feature. The concern is that we will end up like Hadoop that had to develop > another sub-layer of FileSystem interface to accommodate different unique > features of different storage providers when the specialized feature > request comes, and at that time there is no difference from the dedicated > FileIO + ResolvingFileIO architecture. > > > > I wonder what Daniel thinks about this since I believe he is more > interested in multi-cloud support. > > > > -Jack > > > > On Wed, Dec 1, 2021 at 3:18 PM Mayur Srivastava < > mayur.srivast...@twosigma.com> wrote: > > Hi Jack, Daniel, > > > > We use several S3-compatible backends with Iceberg, these include S3, GCS, > and others. Currently, S3FileIO provides us all the functionality we need > Iceberg to talk to these backends. The way we create S3FileIO is via the > constructor and provide the S3Client as the constructor param; we do not > use the initialize(Map<String,String>) method in FileIO. Our custom catalog > accepts the FileIO object at creation time. To talk to GCS, we create the > S3Client with a few overrides (described below) and pass it to S3FileIO. > After that, the rest of the S3FileIO code works as is. The only exception > is that “gs” (used by GCS URIs) needs to be accepted as a valid S3 prefix. > This is the reason I sent the email. > > > > The reason why we want to use S3FileIO to talk to GCS is that S3FileIO > almost works out of the box and contains all the functionality needed to > talk to GCS. The only special requirement is the creation of the S3Client > and allow “gs” prefix in the URIs. Based on our early experiments and > benchmarks, S3FileIO provides all the functionality we need and performs > well, so we didn’t see a need to create a native GCS FileIO. Iceberg > operations that we need are create, drop, read and write objects from S3 > and S3FileIO provides this functionality. > > > > We are managing ACLs (IAM in case of GCS) at the bucket level and that > happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve > not experimented with ACLs or encryption with S3FileIO and that is a good > question whether it works with GCS. But, if these features are not enabled > via default settings, S3FileIO works just fine with GCS. > > > > I think there is a case for supporting S3-compatible backends in S3FileIO > because a lot of the code is common. The question is whether we can cleanly > expose the common S3FileIO code to work with these backends and separate > out any specialization (if required) OR we want to have a different FileIO > implementation for each of the other S3 compatible backends such as GCS? > I’m eager to hear more from the community about this. I’m happy to discuss > and follow long-term design direction of the Iceberg community. > > > > The S3Client for GCS is created as follows (currently the code is not open > source so I’m sharing the steps only): > > 1. Create S3ClientBuilder. > > 2. Set GCS endpoint URI and region. > > 3. Set a credentials provider that returns null. You can set credentials > here if you have static credentials. > > 4. Set ClientOverrideConfiguration with interceptors in the > overrideConfiguration(). The interceptors are used to setup authorization > header in requests (setting projectId, auth tokens, etc.) and do header > translation for requests and responses. > > 5. Build the S3Client. > > 6. Pass the S3Client to S3FileIO. > > > > Thanks, > > Mayur > > > > *From:* Jack Ye <yezhao...@gmail.com> > *Sent:* Wednesday, December 1, 2021 1:16 PM > *To:* Iceberg Dev List <dev@iceberg.apache.org> > *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 > Storage > > > > Hi Mayur, > > > > I know many object storage services have allowed communication using the > Amazon S3 client by implementing the same protocol, like recently the Dell > EMC ECS and Aliyun OSS. But ultimately there are functionality differences > that could be optimized with a native FileIO, and the 2 examples I listed > before both contributed their own FileIO implementations to Iceberg > recently. I would imagine some native S3 features like ACL or SSE to not > work for GCS, and some GCS features to be not supported in S3FileIO, so I > think a specific GCS FileIO would likely be better for GCS support in the > long term. > > > > Could you describe how you configure S3FileIO to talk to GCS? Do you need > to override the S3 endpoint or have any other configurations? > > > > And I am not an expert of GCS, do you see using S3FileIO for GCS as a > feasible long-term solution? Are there any GCS specific features that you > might need and could not be done through S3FileIO, and how widely used are > those features? > > > > Best, > > Jack Ye > > > > > > > > On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks <daniel.c.we...@gmail.com> > wrote: > > The S3FileIO does use the AWS S3 V2 Client libraries and while there > appears to be some level of compatibility, it's not clear to me how far > that currently extends (some AWS features like encryption, IAM, etc. may > not have full support). > > > > I think it's great that there may be a path for more native GCS FileIO > support, but it might be a little early to rename the classes and except > that everything will work cleanly. > > > > Thanks for pointing this out, Mayur. It's really an interesting > development. > > > > -Dan > > > > On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen <pi...@starburstdata.com> > wrote: > > if S3FileIO is supposed to be used with other file systems, we should > consider proper class renames. > > just my 2c > > > > On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava < > mayur.srivast...@twosigma.com> wrote: > > Hi, > > > > We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible > with the AWS S3 SDKs and if they are added to the list of supported > prefixes, they work with S3FileIO. > > > > Thanks, > > Mayur > > > > *From:* Piotr Findeisen <pi...@starburstdata.com> > *Sent:* Wednesday, December 1, 2021 10:58 AM > *To:* Iceberg Dev List <dev@iceberg.apache.org> > *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 > Storage > > > > Hi > > > > Just curious. S3URI seems aws s3-specific. What would be the goal of using > S3URI with google cloud storage urls? > > what problem are we solving? > > > > PF > > > > > > On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > > Sounds reasonable to me if they are compatible > > > > On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava < > mayur.srivast...@twosigma.com> wrote: > > Hi, > > > > We have URIs starting with gs:// representing objects on GCS. Currently, > S3URI doesn’t support gs:// prefix (see > https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41). > Is there an existing JIRA for supporting this? Any objections to add “gs” > to the list of S3 prefixes? > > > > Thanks, > > Mayur > > > > > > > -- > > Ryan Blue > > Tabular > >