Hi all, Poorvank(cc'ed) & I would like to start a discussion about a potential improvement for Flink's Google Cloud Storage integration to create a native GCS filesystem independent of Hadoop. Earlier we were able do for s3 [1]
The entire effort is to move forward to a Hadoop-free Flink Filesystem and unlock potential performance benefits for Flink's focus requirements. The goal of this proposal is to explore whether Flink would benefit from a first-class GCS filesystem implementation built directly on top of Google Cloud Storage client libraries rather than relying on the Hadoop connector. If the discussion gains positive traction, the next step would be to prepare a formal FLIP. The Current State Today, Flink's GCS support is provided through flink-gs-fs-hadoop [2], which is based on Google's Cloud Storage Hadoop connector [3]. This approach has served Flink well, but it also introduces some limitations: 1. Flink's GCS integration depends on the Hadoop filesystem abstraction and the Hadoop-based GCS connector. As a result, upgrades and feature adoption are tied to the evolution of those external components. 2. The dependency stack is larger than necessary for users who only require Google Cloud Storage support. In practice, users must bring in Hadoop-based components even though the underlying storage system is an object store. 3. Leveraging new capabilities from Google Cloud Storage often requires waiting for support to become available through the Hadoop connector before Flink can benefit from them. Proposed Direction I would like to explore the feasibility of a new filesystem implementation, tentatively named flink-gs-fs-native, built directly on top of Google Cloud Storage client libraries. The goals would be: 1. Provide a Hadoop-independent implementation of Flink's FileSystem API for Google Cloud Storage. 2. Reduce dependency complexity and make the GCS integration easier to maintain and evolve. 3. Allow Flink to adopt new Google Cloud Storage features and performance improvements directly, without depending on Hadoop abstractions. 4. Continue supporting Flink features such as checkpointing, savepoints, state backends, and file sinks through a native implementation. A Possible Migration Path To ensure a smooth transition, a phased approach could be considered: Phase 1: Introduce the native GCS filesystem as an optional plugin alongside the existing flink-gs-fs-hadoop connector. Phase 2: Gather community feedback, validate production readiness, and achieve feature parity with the existing implementation. Phase 3: If the native implementation proves mature and broadly adopted, discuss whether the Hadoop-based implementation should remain, be deprecated, or continue to coexist. Questions for the Community 1. What are the biggest pain points users face today with flink-gs-fs-hadoop? 2. Are there any critical capabilities provided by the Hadoop-based GCS connector that would be difficult or undesirable to reimplement? 3. Would a Hadoop-independent GCS filesystem provide meaningful value for your Flink deployments? 4. Are there specific GCS features or operational concerns that should be considered from the beginning? Looking forward to hearing the community's thoughts. Best, Samrat [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-555%3A+Flink+Native+S3+FileSystem [2] https://github.com/apache/flink/tree/master/flink-filesystems/flink-gs-fs-hadoop [3] https://github.com/GoogleCloudDataproc/hadoop-connectors/tree/master/gcs
