Hi all,

Poorvank(cc'ed) & I would like to start a discussion about a potential
improvement for Flink's
Google Cloud Storage integration to create a native GCS filesystem
independent of Hadoop. Earlier we were able do for s3 [1]

The entire effort is to move forward to a Hadoop-free Flink Filesystem and
unlock potential performance benefits for Flink's focus requirements.

The goal of this proposal is to explore whether Flink would benefit from a
first-class GCS filesystem implementation built directly on top of Google
Cloud Storage client libraries rather than relying on the Hadoop connector.
If the discussion gains positive traction, the next step would be to prepare
a formal FLIP.

The Current State
Today, Flink's GCS support is provided through flink-gs-fs-hadoop [2],
which is based on Google's Cloud Storage Hadoop connector [3].

This approach has served Flink well, but it also introduces some
limitations:

   1.

   Flink's GCS integration depends on the Hadoop filesystem abstraction and
   the Hadoop-based GCS connector. As a result, upgrades and feature
   adoption
   are tied to the evolution of those external components.
   2.

   The dependency stack is larger than necessary for users who only require
   Google Cloud Storage support. In practice, users must bring in
   Hadoop-based
   components even though the underlying storage system is an object store.
   3.

   Leveraging new capabilities from Google Cloud Storage often requires
   waiting for support to become available through the Hadoop connector
   before
   Flink can benefit from them.

Proposed Direction

I would like to explore the feasibility of a new filesystem implementation,
tentatively named flink-gs-fs-native, built directly on top of Google Cloud
Storage client libraries.

The goals would be:

   1.

   Provide a Hadoop-independent implementation of Flink's FileSystem API for
   Google Cloud Storage.
   2.

   Reduce dependency complexity and make the GCS integration easier to
   maintain and evolve.
   3.

   Allow Flink to adopt new Google Cloud Storage features and performance
   improvements directly, without depending on Hadoop abstractions.
   4.

   Continue supporting Flink features such as checkpointing, savepoints,
   state backends, and file sinks through a native implementation.

A Possible Migration Path

To ensure a smooth transition, a phased approach could be considered:

Phase 1:
Introduce the native GCS filesystem as an optional plugin alongside the
existing flink-gs-fs-hadoop connector.

Phase 2:
Gather community feedback, validate production readiness, and achieve
feature parity with the existing implementation.

Phase 3:
If the native implementation proves mature and broadly adopted, discuss
whether the Hadoop-based implementation should remain, be deprecated, or
continue to coexist.

Questions for the Community

   1.

   What are the biggest pain points users face today with
   flink-gs-fs-hadoop?
   2.

   Are there any critical capabilities provided by the Hadoop-based GCS
   connector that would be difficult or undesirable to reimplement?
   3.

   Would a Hadoop-independent GCS filesystem provide meaningful value for
   your Flink deployments?
   4.

   Are there specific GCS features or operational concerns that should be
   considered from the beginning?

Looking forward to hearing the community's thoughts.

Best,
Samrat

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-555%3A+Flink+Native+S3+FileSystem


[2]
https://github.com/apache/flink/tree/master/flink-filesystems/flink-gs-fs-hadoop

[3] https://github.com/GoogleCloudDataproc/hadoop-connectors/tree/master/gcs

Reply via email to