What about credentials? Sure, GCS has a S3 compatibility mode, but the gs:// URI used by Hadoop is native GCS support with Google authentication mechanisms (GCS Hadoop filesystem is actually out of tree -> https://github.com/GoogleCloudDataproc/hadoop-connectors)
Laurent On Thu, Dec 2, 2021 at 3:05 PM Jack Ye <yezhao...@gmail.com> wrote: > Also https://github.com/apache/iceberg/pull/3658. > > Please let me know if these are enough, we can discuss in the PRs. It > would also be great if there are users of systems like MinIO to confirm. > > -Jack > > On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava < > mayur.srivast...@twosigma.com> wrote: > >> Looks like Jack is already on the top of the problem ( >> https://github.com/apache/iceberg/pull/3656). Thanks Jack! >> >> >> >> *From:* Mayur Srivastava <mayur.srivast...@twosigma.com> >> *Sent:* Thursday, December 2, 2021 4:16 PM >> *To:* dev@iceberg.apache.org >> *Subject:* RE: Supporting gs:// prefix in S3URI for Google Cloud S3 >> Storage >> >> >> >> There are three reasons why we want to use S3FileIO over HadoopFileIO: >> >> 1. We want access to the S3Client in our service so support some >> special handling of the auth. This is not possible with the HadoopFileIO >> because the S3Client is not exposed. >> >> 2. We would like to improve upon the S3FileIO in the future, by >> introducing a vectorized IO mechanism and it makes is easier if we are >> already using S3FileIO. I’ll post my thoughts about the vectorized IO in a >> later email in upcoming weeks. >> >> 3. As Ryan mentioned earlier, we are seeing very high memory usage >> with the HadoopFileIO in case of high concurrent commits. I reported that >> in another thread. >> >> >> >> To moving forward: >> >> >> >> Can we start by adding ‘gs’ to the S3URI’s valid prefixes? >> >> >> >> One of Jack’s suggestion was to remove any scheme check from the S3URI. >> Given we are building ResolvingFileIO, I think removing scheme check in the >> individual implementation is not a bad idea. >> >> >> >> Either solution will work for us. >> >> >> >> Thanks, >> >> Mayur >> >> >> >> *From:* Ryan Blue <b...@tabular.io> >> *Sent:* Thursday, December 2, 2021 11:37 AM >> *To:* Iceberg Dev List <dev@iceberg.apache.org> >> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 >> Storage >> >> >> >> I think the advantage of S3FileIO over HadoopFileIO with s3a is it >> doesn't hit the memory consumption problem that Mayur posted to the list. >> That's a fairly big advantage so I think it's reasonable to try to support >> this in 0.13.0. >> >> >> >> It should be easy enough to add the gs scheme and then we can figure out >> how we want to handle ResolvingFileIO. Jack's plan seems reasonable to me, >> so I guess we'll be adding scheme to implementation customization sooner >> than I thought! >> >> >> >> Ryan >> >> >> >> On Thu, Dec 2, 2021 at 1:24 AM Piotr Findeisen <pi...@starburstdata.com> >> wrote: >> >> Hi >> >> >> >> I agree that endpoint, credentials, path style access etc. should be >> configurable. >> >> There are storages which are primarily used as "s3 compatible" and they >> need these settings to make them work. >> >> We've seen these being used to access MinIO, Ceph and even S3 with some >> gateway (i am light on details, sorry). >> >> In all these cases, users seem to use s3:// urls even if not talking to >> actual AWS S3 service. >> >> >> >> If this is sufficient for GCS, we could create GCSFileIO, or GCSS3FileIO, >> just by accepting gs:// protocol and delegating to S3FileIO for now. >> >> In the long term, i would recommend using native GCS client though, or >> hadoop file system implementation provided by google. >> >> >> >> BTW, Mayur what is the advantage of using S3FileIO for google storage >> vs HadoopFileIO? >> >> >> >> BR >> >> PF >> >> >> >> >> >> >> >> >> >> On Thu, Dec 2, 2021 at 1:30 AM Jack Ye <yezhao...@gmail.com> wrote: >> >> And here is a proposal of what I think could be the best way to go for >> both worlds: >> >> (1) remove URI restrictions in S3FileIO (or allow configuration of >> additional accepted schemes), and allow direct user configuration of >> endpoint, credentials, etc. to make S3 configuration simpler without the >> need to reconfigure the entire client. >> >> (2) configure ResolvingFileIO to map s3 -> S3FileIO, gs -> S3FileIO, >> others -> HadoopFileIO >> >> (3) for s3 and gs, ResolvingFileIO needs to develop the ability to >> initialize S3FileIO differently, and users should be able to configure them >> differently in catalog properties >> >> (4) for users that need special GCS unique features, a GCSFileIO could >> eventually be developed, and then people can choose to map gs -> GCSFileIO >> in ResolvingFileIO >> >> >> >> -Jack >> >> >> >> >> >> On Wed, Dec 1, 2021 at 4:14 PM Jack Ye <yezhao...@gmail.com> wrote: >> >> Thanks for the confirmation, this is as I expected. We had a similar case >> for Dell EMC ECS recently, where they published a version of their FileIO >> that works through S3FileIO (https://github.com/apache/iceberg/pull/2807) >> and the only thing needed was to override the endpoint, region and >> credentials. They also proposed some specialization because their object >> storage service is specialized with the Append operation when writing data. >> However, in the end they ended up just creating another FileIO ( >> https://github.com/apache/iceberg/pull/3376) using their own SDK to >> better support the specialization. >> >> >> >> I believe the recent addition of ResolvingFileIO was to support using >> multiple FileIOs and switch between them based on the file scheme. If we >> continue that path, it feels more reasonable to me that we will have >> specialized FileIOs for each implementation and allow them to evolve >> independently. Users will be able to set whatever specialized >> configurations for each implementation and take advantage of all of them. >> >> >> >> On the other hand, if we can support using S3FileIO as the new standard >> FileIO that works with multiple storage providers, the advantages I see >> are: >> >> (1) simple from the user's perspective because the least common >> denominator of all storages needed by many cloud storage service providers >> is S3. It's more work to configure and maintain multiple FileIOs. >> >> (2) we can avoid the current check in ResolvingFileIO of the file scheme >> for each file path string, which might lead to some performance gain, >> although I do not know how much we gain in this process >> >> >> >> From a technical perspective I prefer having dedicated FileIOs and an >> overall ResolvingFileIO, because the Iceberg's FileIO interface is simple >> enough for people to build specialized and proper support for different >> storage systems. But it's also very tempting to just reuse the same thing >> instead of building another one, especially when that feature is lacking >> and the current functionality could be easily extended to support the >> feature. The concern is that we will end up like Hadoop that had to develop >> another sub-layer of FileSystem interface to accommodate different unique >> features of different storage providers when the specialized feature >> request comes, and at that time there is no difference from the dedicated >> FileIO + ResolvingFileIO architecture. >> >> >> >> I wonder what Daniel thinks about this since I believe he is more >> interested in multi-cloud support. >> >> >> >> -Jack >> >> >> >> On Wed, Dec 1, 2021 at 3:18 PM Mayur Srivastava < >> mayur.srivast...@twosigma.com> wrote: >> >> Hi Jack, Daniel, >> >> >> >> We use several S3-compatible backends with Iceberg, these include S3, >> GCS, and others. Currently, S3FileIO provides us all the functionality we >> need Iceberg to talk to these backends. The way we create S3FileIO is via >> the constructor and provide the S3Client as the constructor param; we do >> not use the initialize(Map<String,String>) method in FileIO. Our custom >> catalog accepts the FileIO object at creation time. To talk to GCS, we >> create the S3Client with a few overrides (described below) and pass it to >> S3FileIO. After that, the rest of the S3FileIO code works as is. The only >> exception is that “gs” (used by GCS URIs) needs to be accepted as a valid >> S3 prefix. This is the reason I sent the email. >> >> >> >> The reason why we want to use S3FileIO to talk to GCS is that S3FileIO >> almost works out of the box and contains all the functionality needed to >> talk to GCS. The only special requirement is the creation of the S3Client >> and allow “gs” prefix in the URIs. Based on our early experiments and >> benchmarks, S3FileIO provides all the functionality we need and performs >> well, so we didn’t see a need to create a native GCS FileIO. Iceberg >> operations that we need are create, drop, read and write objects from S3 >> and S3FileIO provides this functionality. >> >> >> >> We are managing ACLs (IAM in case of GCS) at the bucket level and that >> happens in our custom catalog. GCS has ACLs but IAMs are preferred. I’ve >> not experimented with ACLs or encryption with S3FileIO and that is a good >> question whether it works with GCS. But, if these features are not enabled >> via default settings, S3FileIO works just fine with GCS. >> >> >> >> I think there is a case for supporting S3-compatible backends in S3FileIO >> because a lot of the code is common. The question is whether we can cleanly >> expose the common S3FileIO code to work with these backends and separate >> out any specialization (if required) OR we want to have a different FileIO >> implementation for each of the other S3 compatible backends such as GCS? >> I’m eager to hear more from the community about this. I’m happy to discuss >> and follow long-term design direction of the Iceberg community. >> >> >> >> The S3Client for GCS is created as follows (currently the code is not >> open source so I’m sharing the steps only): >> >> 1. Create S3ClientBuilder. >> >> 2. Set GCS endpoint URI and region. >> >> 3. Set a credentials provider that returns null. You can set credentials >> here if you have static credentials. >> >> 4. Set ClientOverrideConfiguration with interceptors in the >> overrideConfiguration(). The interceptors are used to setup authorization >> header in requests (setting projectId, auth tokens, etc.) and do header >> translation for requests and responses. >> >> 5. Build the S3Client. >> >> 6. Pass the S3Client to S3FileIO. >> >> >> >> Thanks, >> >> Mayur >> >> >> >> *From:* Jack Ye <yezhao...@gmail.com> >> *Sent:* Wednesday, December 1, 2021 1:16 PM >> *To:* Iceberg Dev List <dev@iceberg.apache.org> >> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 >> Storage >> >> >> >> Hi Mayur, >> >> >> >> I know many object storage services have allowed communication using the >> Amazon S3 client by implementing the same protocol, like recently the Dell >> EMC ECS and Aliyun OSS. But ultimately there are functionality differences >> that could be optimized with a native FileIO, and the 2 examples I listed >> before both contributed their own FileIO implementations to Iceberg >> recently. I would imagine some native S3 features like ACL or SSE to not >> work for GCS, and some GCS features to be not supported in S3FileIO, so I >> think a specific GCS FileIO would likely be better for GCS support in the >> long term. >> >> >> >> Could you describe how you configure S3FileIO to talk to GCS? Do you need >> to override the S3 endpoint or have any other configurations? >> >> >> >> And I am not an expert of GCS, do you see using S3FileIO for GCS as a >> feasible long-term solution? Are there any GCS specific features that you >> might need and could not be done through S3FileIO, and how widely used are >> those features? >> >> >> >> Best, >> >> Jack Ye >> >> >> >> >> >> >> >> On Wed, Dec 1, 2021 at 8:50 AM Daniel Weeks <daniel.c.we...@gmail.com> >> wrote: >> >> The S3FileIO does use the AWS S3 V2 Client libraries and while there >> appears to be some level of compatibility, it's not clear to me how far >> that currently extends (some AWS features like encryption, IAM, etc. may >> not have full support). >> >> >> >> I think it's great that there may be a path for more native GCS FileIO >> support, but it might be a little early to rename the classes and except >> that everything will work cleanly. >> >> >> >> Thanks for pointing this out, Mayur. It's really an interesting >> development. >> >> >> >> -Dan >> >> >> >> On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen <pi...@starburstdata.com> >> wrote: >> >> if S3FileIO is supposed to be used with other file systems, we should >> consider proper class renames. >> >> just my 2c >> >> >> >> On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava < >> mayur.srivast...@twosigma.com> wrote: >> >> Hi, >> >> >> >> We are using S3FileIO to talk to the GCS backend. GCS URIs are compatible >> with the AWS S3 SDKs and if they are added to the list of supported >> prefixes, they work with S3FileIO. >> >> >> >> Thanks, >> >> Mayur >> >> >> >> *From:* Piotr Findeisen <pi...@starburstdata.com> >> *Sent:* Wednesday, December 1, 2021 10:58 AM >> *To:* Iceberg Dev List <dev@iceberg.apache.org> >> *Subject:* Re: Supporting gs:// prefix in S3URI for Google Cloud S3 >> Storage >> >> >> >> Hi >> >> >> >> Just curious. S3URI seems aws s3-specific. What would be the goal of >> using S3URI with google cloud storage urls? >> >> what problem are we solving? >> >> >> >> PF >> >> >> >> >> >> On Wed, Dec 1, 2021 at 4:56 PM Russell Spitzer <russell.spit...@gmail.com> >> wrote: >> >> Sounds reasonable to me if they are compatible >> >> >> >> On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava < >> mayur.srivast...@twosigma.com> wrote: >> >> Hi, >> >> >> >> We have URIs starting with gs:// representing objects on GCS. Currently, >> S3URI doesn’t support gs:// prefix (see >> https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41). >> Is there an existing JIRA for supporting this? Any objections to add “gs” >> to the list of S3 prefixes? >> >> >> >> Thanks, >> >> Mayur >> >> >> >> >> >> >> -- >> >> Ryan Blue >> >> Tabular >> >