[ 
https://issues.apache.org/jira/browse/HADOOP-19343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937215#comment-17937215
 ] 

Chris Nauroth commented on HADOOP-19343:
----------------------------------------

We had a productive meeting about this on 2025-03-20, attended by me, 
[~arunchacko], [~mthakur], [~ste...@apache.org], with an exciting cameo from 
[~lmccay]! I took notes, and I'd like to summarize here for the whole community:
 * Reiterating the high-level proposal, we want to bring over a port of the 
code from [the existing 
repository|https://github.com/GoogleCloudDataproc/hadoop-connectors/], with 
some simplifications made along the way. For example, we don't plan on 
retaining the existing Maven multi-module structure, because that doesn't serve 
any benefit in Apache Hadoop, and existing precedent is that the other cloud 
file systems are a single module.
 * Acceptance criteria includes all tests passing, including file system 
contract tests and execution of TPC benchmarks. We're initially targeting 
within 15% of performance of the existing repo with iterative improvements 
after that.
 * Our end goal is 100% feature parity with the existing repo. To make the 
project more manageable, we want to structure this in milestones with initial 
"must have" features and additional features to be added incrementally.
 * One specific example of a feature we won't target initially is [GCS 
hierarchical namespace|https://cloud.google.com/storage/docs/hns-overview] 
("HNS") support. We'll be able to work with HNS buckets, but the initial port 
won't include optimizations implemented for HNS buckets. There was a side 
discussion about how this might not be too impactful considering the general 
motion toward manifest table formats like Iceberg and Hudi instead of the 
traditional Hive table layout. These don't drive heavy rename traffic in the 
same way.
 * The group agreed on targeting Apache Hadoop 3.5.0, aligned with Java 17 
support. This means users who need Java 8 or 11 support wouldn't be able to use 
it, but they still have access to the stable releases from the existing repo to 
help with that.
 * The group was generally interested in keeping a short-lived feature branch 
and including it for 3.5.0 ASAP. We all want Java 17! Detailed information on 
testing would really help, especially anything that goes beyond the typical 
{{mvn verify}} unit + integration testing setup.
 * We discussed dependency management as a potential pain point. The existing 
GCSFS repo has taken the strategy of heavily shading dependencies, especially 
protobuf, grpc and guava. Apache Hadoop has taken the approach of a shared 
bundle of shaded stuff via the hadoop-thirdparty repo. We left this as an open 
question to be settled later.
 * There was some discussion of the new GCS [client-side credential access 
boundary|https://cloud.google.com/iam/docs/downscoping-short-lived-credentials#client-side-token-exchange]
 feature and how it relates to 
[PR#587|https://github.com/GoogleCloudDataproc/hadoop-connectors/pull/587/files].
 This PR provides hooks for the file system's token provider to express access 
boundaries. You can use these hooks to implement a plugin integration with STS, 
and that integration could use client-side exchange. However, the existing repo 
does not contain a full end-to-end STS integration, and we don't have that in 
current scope for transfer to ASF.
 * Sidebar: I got voluntold to be the 3.5.0 release manager. :-D I was planning 
on volunteering anyway, so I'm happy to help.

Next steps:
 * Chris will cut a feature branch for HADOOP-19343.
 * Arun will share an updated doc with more details on the development plan, 
including the intended roadmap of which features to add and when.

> Add native support for GCS connector
> ------------------------------------
>
>                 Key: HADOOP-19343
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19343
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 3.5.0
>            Reporter: Abhishek Modi
>            Assignee: Arunkumar Chacko
>            Priority: Major
>         Attachments: GCS connector for Hadoop.pdf
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to