[ https://issues.apache.org/jira/browse/HADOOP-19343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937215#comment-17937215 ]
Chris Nauroth commented on HADOOP-19343: ---------------------------------------- We had a productive meeting about this on 2025-03-20, attended by me, [~arunchacko], [~mthakur], [~ste...@apache.org], with an exciting cameo from [~lmccay]! I took notes, and I'd like to summarize here for the whole community: * Reiterating the high-level proposal, we want to bring over a port of the code from [the existing repository|https://github.com/GoogleCloudDataproc/hadoop-connectors/], with some simplifications made along the way. For example, we don't plan on retaining the existing Maven multi-module structure, because that doesn't serve any benefit in Apache Hadoop, and existing precedent is that the other cloud file systems are a single module. * Acceptance criteria includes all tests passing, including file system contract tests and execution of TPC benchmarks. We're initially targeting within 15% of performance of the existing repo with iterative improvements after that. * Our end goal is 100% feature parity with the existing repo. To make the project more manageable, we want to structure this in milestones with initial "must have" features and additional features to be added incrementally. * One specific example of a feature we won't target initially is [GCS hierarchical namespace|https://cloud.google.com/storage/docs/hns-overview] ("HNS") support. We'll be able to work with HNS buckets, but the initial port won't include optimizations implemented for HNS buckets. There was a side discussion about how this might not be too impactful considering the general motion toward manifest table formats like Iceberg and Hudi instead of the traditional Hive table layout. These don't drive heavy rename traffic in the same way. * The group agreed on targeting Apache Hadoop 3.5.0, aligned with Java 17 support. This means users who need Java 8 or 11 support wouldn't be able to use it, but they still have access to the stable releases from the existing repo to help with that. * The group was generally interested in keeping a short-lived feature branch and including it for 3.5.0 ASAP. We all want Java 17! Detailed information on testing would really help, especially anything that goes beyond the typical {{mvn verify}} unit + integration testing setup. * We discussed dependency management as a potential pain point. The existing GCSFS repo has taken the strategy of heavily shading dependencies, especially protobuf, grpc and guava. Apache Hadoop has taken the approach of a shared bundle of shaded stuff via the hadoop-thirdparty repo. We left this as an open question to be settled later. * There was some discussion of the new GCS [client-side credential access boundary|https://cloud.google.com/iam/docs/downscoping-short-lived-credentials#client-side-token-exchange] feature and how it relates to [PR#587|https://github.com/GoogleCloudDataproc/hadoop-connectors/pull/587/files]. This PR provides hooks for the file system's token provider to express access boundaries. You can use these hooks to implement a plugin integration with STS, and that integration could use client-side exchange. However, the existing repo does not contain a full end-to-end STS integration, and we don't have that in current scope for transfer to ASF. * Sidebar: I got voluntold to be the 3.5.0 release manager. :-D I was planning on volunteering anyway, so I'm happy to help. Next steps: * Chris will cut a feature branch for HADOOP-19343. * Arun will share an updated doc with more details on the development plan, including the intended roadmap of which features to add and when. > Add native support for GCS connector > ------------------------------------ > > Key: HADOOP-19343 > URL: https://issues.apache.org/jira/browse/HADOOP-19343 > Project: Hadoop Common > Issue Type: Improvement > Components: fs > Affects Versions: 3.5.0 > Reporter: Abhishek Modi > Assignee: Arunkumar Chacko > Priority: Major > Attachments: GCS connector for Hadoop.pdf > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org