Re: [DISCUSSION] Native S3 Filesystem in Apache Flink

Samrat Deb Fri, 06 Feb 2026 00:43:02 -0800

Hi Ferenc,

Yes, I have completed the Design Doc for the proposal. I intend to start a
separate thread focusing on the design and implementation details.


Cheers,
Samrat

On Thu, Feb 5, 2026 at 12:00 AM Ferenc Csaky <[email protected]> wrote:

> Hi Samrat,
>
> Thanks for driving this! I think it would be good to start a separate
> [DISCUSS][FLIP-555] thread. We can refer this thread there as previous
> discussions, but IMO it would be good to have the FLIP it's own thread,
> following the FLIP process [1].
>
> I'm happy to review myself in the next couple days.
>
> Best,
> Ferenc
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
>
>
> On Tuesday, February 3rd, 2026 at 22:42, Samrat Deb <[email protected]>
> wrote:
>
> >
> >
> > Hi,
> > I conducted a benchmarking comparison of state checkpointing to S3,
> > comparing the proposed native S3 implementation with flink-s3-fs-presto.
> > The results are promising. The native implementation performs better
> under
> > the setup used.
> > PTAL at the benchmark document for detailed analysis with logs and setup
> > details[1]
> >
> > As a next step, FLIP-555[2] is out for review. PTAL
> >
> > Cheers,
> > Samrat
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406620396
> > [2]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-555%3A+Flink+Native+S3+FileSystem
> >
> >
> > On Wed, Nov 12, 2025 at 1:21 AM Samrat Deb [email protected] wrote:
> >
> > > Hi Gabor,
> > >
> > > Apologies for the delayed response.
> > >
> > > > - A migration guide would be excellent from the old connectors. That
> way
> > > > users can see how much effort it is.
> > >
> > > Yes, that’s one of the key aspects. I’ve tested the patch on S3. The
> > > configuration remains exactly the same. The only change required is to
> > > place the new `flink-s3-fs-native` JAR in the `plugins` directory and
> > > remove the `flink-s3-fs-hadoop` JAR from there.
> > > I haven’t documented a detailed design or migration plan yet. I’m
> waiting
> > > for the first round of benchmark and comparison test results.
> > >
> > > > - One of the key points from operational perspective is to have a
> way to
> > > > make IOPS usage
> > > > configurable. As on oversimplified explanation just to get a taste
> this
> > > > can
> > > > be kept under control in 2 ways and places:
> > > > 1. In Hadoop s3a set `fs.s3a.limit.total`
> > > > 2. In connector set `s3.multipart.upload.min.file.size` and
> > > > `s3.multipart.upload.min.part.size`
> > > > Do I understand it correctly that this is intended to be covered by
> the
> > > > following configs?
> > >
> > > > | s3.upload.min.part.size | 5242880 | Minimum part size for multipart
> > > > uploads (5MB) |
> > > > | s3.upload.max.concurrent.uploads | CPU cores | Maximum concurrent
> > > > uploads
> > > > per stream |
> > >
> > > Yes, the POC patch currently includes three configurations[1]:
> > > 1. `s3.upload.min.part.size`
> > > 2. `s3.upload.max.concurrent.uploads`
> > > 3. `s3.read.buffer.size`
> > >
> > > The idea is to start by supporting configurable IOPS through these
> > > parameters.
> > > Do you think these minimal configs are sufficient to begin with?
> > >
> > > > > I am now drafting a formal benchmark plan based on these specifics
> and
> > > > > will share it with this thread in the coming days for feedback.
> > > > > Waiting for the details.
> > >
> > > Still Waiting for my employer to approve resources for the purpose 😅
> > >
> > > Cheers,
> > > Samrat
> > >
> > > [1]
> > >
> https://github.com/apache/flink/pull/27187/files#diff-f1e31c70c03cb943bc0e62fe456ca8d0b6bb63ae56c062d68f54ce2806b43f45R38
> > >
> > > On Wed, Nov 5, 2025 at 5:34 PM Gabor Somogyi [email protected]
> > > wrote:
> > >
> > > > Hi Samrat,
> > > >
> > > > Thanks for the contribution! I've had a slight look at the code
> which is
> > > > promising.
> > > >
> > > > I've a couple of questions/remarks:
> > > > - A migration guide would be excellent from the old connectors. That
> way
> > > > users can see how much effort it is.
> > > > - One of the key points from operational perspective is to have a
> way to
> > > > make IOPS usage
> > > > configurable. As on oversimplified explanation just to get a taste
> this
> > > > can
> > > > be kept under control in 2 ways and places:
> > > > 1. In Hadoop s3a set `fs.s3a.limit.total`
> > > > 2. In connector set `s3.multipart.upload.min.file.size` and
> > > > `s3.multipart.upload.min.part.size`
> > > > Do I understand it correctly that this is intended to be covered by
> the
> > > > following configs?
> > > >
> > > > | s3.upload.min.part.size | 5242880 | Minimum part size for multipart
> > > > uploads (5MB) |
> > > > | s3.upload.max.concurrent.uploads | CPU cores | Maximum concurrent
> > > > uploads
> > > > per stream |
> > > >
> > > > > I am now drafting a formal benchmark plan based on these specifics
> and
> > > > > will share it with this thread in the coming days for feedback.
> > > > > Waiting for the details.
> > > >
> > > > BR,
> > > > G
> > > >
> > > > On Wed, Nov 5, 2025 at 7:08 AM Samrat Deb [email protected]
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I have a working POC for the Native S3 filesystem, which is now
> > > > > available
> > > > > as a draft PR [1].
> > > > > The POC is functional and has been validated in a local setup with
> > > > > Minio.
> > > > > It's important to note that it does not yet have complete test
> coverage.
> > > > >
> > > > > The immediate next step is to conduct a comprehensive benchmark to
> > > > > compare
> > > > > its performance against the existing `flink-s3-fs-hadoop` and
> > > > > `flink-s3-fs-presto` implementations.
> > > > >
> > > > > I've had a very meaningful discussion with Piotr Nowojski about
> this
> > > > > offline. I am grateful for his detailed guidance on defining a
> rigorous
> > > > > benchmarking strategy, including specific cluster configurations,
> job
> > > > > workloads, and key metrics for evaluating both checkpoint/recovery
> > > > > performance and pure throughput.
> > > > > I am now drafting a formal benchmark plan based on these specifics
> and
> > > > > will
> > > > > share it with this thread in the coming days for feedback.
> > > > >
> > > > > Cheers,
> > > > > Samrat
> > > > >
> > > > > [1] https://github.com/apache/flink/pull/27187
> > > > >
> > > > > On Wed, Oct 29, 2025 at 9:31 PM Samrat Deb [email protected]
> > > > > wrote:
> > > > >
> > > > > > thank you Martijn for clarifying .
> > > > > > i will proceed with creating a task.
> > > > > >
> > > > > > Thanks Mate for the pointer to Minio for testing.
> > > > > > minio is good to use for testing .
> > > > > >
> > > > > > Cheers,
> > > > > > Samrat
> > > > > >
> > > > > > On Mon, 27 Oct 2025 at 11:55 PM, Mate Czagany [email protected]
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Just to add to the MinIO licensing concerns, I could not see
> any
> > > > > > > recent
> > > > > > > change to the license itself, they have changed the license
> from
> > > > > > > Apache
> > > > > > > 2.0
> > > > > > > to AGPL-3.0 in 2021, and the Docker image used by the tests
> (which is
> > > > > > > from
> > > > > > > 2022) already contains the AGPL-3.0 license. This should not
> be an
> > > > > > > issue
> > > > > > > as
> > > > > > > Flink does not distribute nor makes MinIO available over the
> network,
> > > > > > > it's
> > > > > > > only used by the tests.
> > > > > > >
> > > > > > > What's changed recently is that MinIO no longer publishes
> Docker
> > > > > > > images
> > > > > > > to
> > > > > > > the public [1], so it might be worth it to look into using
> > > > > > > alternative
> > > > > > > solutions in the future, e.g. Garage [2].
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Mate
> > > > > > >
> > > > > > > [1]
> > > > > > >
> https://github.com/minio/minio/issues/21647#issuecomment-3418675115
> > > > > > > [2] https://garagehq.deuxfleurs.fr/
> > > > > > >
> > > > > > > On Mon, Oct 27, 2025 at 5:48 PM Ferenc Csaky
> > > > > > > <[email protected]
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Really nice to see people chime into this thread. I agree
> with
> > > > > > > > Martijn
> > > > > > > > about the
> > > > > > > > development approach. There will be some iterations until we
> can
> > > > > > > > stabilize
> > > > > > > > this anyways,
> > > > > > > > so we can try to shoot getting out a good enough MVP, then
> fix
> > > > > > > > issues
> > > > > > > > +
> > > > > > > > reach feature
> > > > > > > > parity with the existing implementations on the go.
> > > > > > > >
> > > > > > > > I am not a licensing expert but AFAIK the previous images
> that were
> > > > > > > > released under the
> > > > > > > > acceptable license can be continued to use. For most
> integration
> > > > > > > > tests,
> > > > > > > > we
> > > > > > > > use an
> > > > > > > > ancient image anyways [1]. There is another place where the
> latest
> > > > > > > > img
> > > > > > > > gets pulled [2],
> > > > > > > > I guess it would be good to apply an explicit that tag
> there. But
> > > > > > > > AFAIK
> > > > > > > > they stop
> > > > > > > > publishing to Docker Hub, so I would anticipate we cannot
> end up
> > > > > > > > pulling
> > > > > > > > an image with
> > > > > > > > a forbidden license.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Ferenc
> > > > > > > >
> > > > > > > > [1]
> > > >
> > > >
> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-test-utils-parent/flink-test-utils-junit/src/main/java/org/apache/flink/util/DockerImageVersions.java#L39
> > > >
> > > > > > > > [2]
> > > >
> > > >
> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-end-to-end-tests/test-scripts/common_s3_minio.sh#L51
> > > >
> > > > > > > > On Sunday, October 26th, 2025 at 22:05, Martijn Visser <
> > > > > > > > [email protected]> wrote:
> > > > > > > >
> > > > > > > > > Hi Samrat,
> > > > > > > > >
> > > > > > > > > First of all, thanks for the proposal. It's long overdue
> to get
> > > > > > > > > this
> > > > > > > > > in a
> > > > > > > > > better state.
> > > > > > > > >
> > > > > > > > > With regards to the schemes, I would say to ship an initial
> > > > > > > > > release
> > > > > > > > > that
> > > > > > > > > does not include support for s3a and s3p, and focus first
> on
> > > > > > > > > getting
> > > > > > > > > this
> > > > > > > > > new implementation into a stable state. When that's done,
> as a
> > > > > > > > > follow-up,
> > > > > > > > > we can consider adding support for s3a and s3p on this
> > > > > > > > > implementation,
> > > > > > > > > and
> > > > > > > > > when that's there consider deprecating the older
> > > > > > > > > implementations. It
> > > > > > > > > will
> > > > > > > > > probably take multiple releases before we have this in a
> stable
> > > > > > > > > state.
> > > > > > > > >
> > > > > > > > > Not directly related to this, but given that MinIO decided
> to
> > > > > > > > > change
> > > > > > > > > their
> > > > > > > > > license, do we also need to refactor existing tests to not
> use
> > > > > > > > > MinIO
> > > > > > > > > anymore but something else?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Martijn
> > > > > > > > >
> > > > > > > > > On Sat, Oct 25, 2025 at 1:38 AM Samrat Deb
> [email protected]
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > One clarifying question regarding the URI schemes:
> > > > > > > > > >
> > > > > > > > > > Currently, the Flink ecosystem uses multiple schemes to
> > > > > > > > > > differentiate
> > > > > > > > > > between S3 implementations: s3a:// for the Hadoop-based
> > > > > > > > > > connector
> > > > > > > > > > and
> > > > > > > > > > s3p://[1] for the Presto-based one, which is often
> recommended
> > > > > > > > > > for
> > > > > > > > > > checkpointing.
> > > > > > > > > >
> > > > > > > > > > A key goal of the proposed flink-s3-fs-native is to
> unify these
> > > > > > > > > > into a
> > > > > > > > > > single implementation. With that in mind, what should be
> the
> > > > > > > > > > strategy
> > > > > > > > > > for
> > > > > > > > > > scheme support? Should the new native s3 filesystem
> register
> > > > > > > > > > only
> > > > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > simple s3:// scheme, aiming to deprecate the others? Or
> would
> > > > > > > > > > it
> > > > > > > > > > be
> > > > > > > > > > beneficial to also support s3a:// and s3p:// to provide a
> > > > > > > > > > smoother
> > > > > > > > > > migration path for users who may have these schemes in
> their
> > > > > > > > > > existing
> > > > > > > > > > job
> > > > > > > > > > configurations?
> > > > > > > > > > Cheers,
> > > > > > > > > > Samrat
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/generalui/s3p
> > > > > > > > > >
> > > > > > > > > > On Wed, Oct 22, 2025 at 6:31 PM Piotr Nowojski
> > > > > > > > > > [email protected]
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Samrat,
> > > > > > > > > > >
> > > > > > > > > > > > 1. Even if the specifics are hazy, could you recall
> the
> > > > > > > > > > > > general
> > > > > > > > > > > > nature of those concerns? For instance, were they
> related
> > > > > > > > > > > > to
> > > > > > > > > > > > S3's
> > > > > > > > > > > > eventual
> > > > > > > > > > > > consistency model, which has since improved, the
> atomicity
> > > > > > > > > > > > of
> > > > > > > > > > > > Multipart
> > > > > > > > > > > > Upload commits, or perhaps complex failure/recovery
> > > > > > > > > > > > scenarios
> > > > > > > > > > > > during
> > > > > > > > > > > > the
> > > > > > > > > > > > commit phase?
> > > > > > > > > > >
> > > > > > > > > > > and
> > > > > > > > > > >
> > > > > > > > > > > > *8. *The flink-s3-fs-presto connector explicitly
> throws an
> > > > > > > > > > > > `UnsupportedOperationException` when
> > > > > > > > > > > > `createRecoverableWriter()` is
> > > > > > > > > > > > called.
> > > > > > > > > > > > Was this a deliberate design choice to keep the
> Presto
> > > > > > > > > > > > connector
> > > > > > > > > > > > lightweight and optimized specifically for
> checkpointing,
> > > > > > > > > > > > or
> > > > > > > > > > > > were
> > > > > > > > > > > > there
> > > > > > > > > > > > other technical challenges that prevented its
> > > > > > > > > > > > implementation
> > > > > > > > > > > > at
> > > > > > > > > > > > the
> > > > > > > > > > > > time?
> > > > > > > > > > > > Any context on this would be very helpful
> > > > > > > > > > >
> > > > > > > > > > > I very vaguely remember that at least one of those
> concerns
> > > > > > > > > > > was
> > > > > > > > > > > with
> > > > > > > > > > > respect to how long
> > > > > > > > > > > does it take for the S3 to make some certain operations
> > > > > > > > > > > visible.
> > > > > > > > > > > That you
> > > > > > > > > > > think you have
> > > > > > > > > > > uploaded and committed a file, but in reality it might
> not be
> > > > > > > > > > > visible for
> > > > > > > > > > > tens of seconds.
> > > > > > > > > > >
> > > > > > > > > > > Sorry, I don't remember more (or even if there was
> more). I
> > > > > > > > > > > was
> > > > > > > > > > > only
> > > > > > > > > > > superficially involved
> > > > > > > > > > > in the S3 connector back then - just
> participated/overheard
> > > > > > > > > > > some
> > > > > > > > > > > discussions.
> > > > > > > > > > >
> > > > > > > > > > > > 2. It's clear that implementing an efficient
> > > > > > > > > > > > PathsCopyingFileSystem[2]
> > > > > > > > > > > > is
> > > > > > > > > > > > a non-negotiable requirement for performance. Is
> there any
> > > > > > > > > > > > benchmark
> > > > > > > > > > > > numbers available that can be used as reference and
> > > > > > > > > > > > evaluate
> > > > > > > > > > > > new
> > > > > > > > > > > > implementation deviation ?
> > > > > > > > > > >
> > > > > > > > > > > I only have the numbers that I put in the original
> Flip [1].
> > > > > > > > > > > I
> > > > > > > > > > > don't
> > > > > > > > > > > remember the benchmark
> > > > > > > > > > > setup, but it must have been something simple. Like
> just let
> > > > > > > > > > > some
> > > > > > > > > > > job
> > > > > > > > > > > accumulate 1GB of state
> > > > > > > > > > > and measure how long the state downloading phase of
> recovery
> > > > > > > > > > > was
> > > > > > > > > > > taking.
> > > > > > > > > > >
> > > > > > > > > > > > 3. Do you recall the workload characteristics for
> that PoC?
> > > > > > > > > > > > Specifically,
> > > > > > > > > > > > was the 30-40% performance advantage of s5cmd
> observed when
> > > > > > > > > > > > copying
> > > > > > > > > > > > many
> > > > > > > > > > > > small files (like checkpoint state) or larger,
> > > > > > > > > > > > multi-gigabyte
> > > > > > > > > > > > files?
> > > > > > > > > > >
> > > > > > > > > > > It was just a regular mix of compacted RocksDB sst
> files,
> > > > > > > > > > > with
> > > > > > > > > > > total
> > > > > > > > > > > state
> > > > > > > > > > > size 1 or at most
> > > > > > > > > > > a couple of GBs. So most of the files were around
> ~64MB or
> > > > > > > > > > > ~128MB,
> > > > > > > > > > > with a
> > > > > > > > > > > couple of
> > > > > > > > > > > smaller L0 files, and maybe one larger L2 file.
> > > > > > > > > > >
> > > > > > > > > > > > 4. The idea of a switchable implementation sounds
> great.
> > > > > > > > > > > > Would
> > > > > > > > > > > > you
> > > > > > > > > > > > envision this as a configuration flag (e.g.,
> > > > > > > > > > > > s3.native.copy.strategy=s5cmd
> > > > > > > > > > > > or s3.native.copy.strategy=sdk) that selects the
> backend
> > > > > > > > > > > > implementation
> > > > > > > > > > > > at
> > > > > > > > > > > > runtime? Also on contrary is it worth adding
> configuration
> > > > > > > > > > > > that
> > > > > > > > > > > > exposes
> > > > > > > > > > > > some level of implementation level information ?
> > > > > > > > > > >
> > > > > > > > > > > I think something like that should be fine, assuming
> that
> > > > > > > > > > > `s5cmd`
> > > > > > > > > > > will
> > > > > > > > > > > again
> > > > > > > > > > > prove significantly faster and/or more cpu efficient.
> If
> > > > > > > > > > > not, if
> > > > > > > > > > > the
> > > > > > > > > > > SDKv2
> > > > > > > > > > > has
> > > > > > > > > > > already improved and caught up with the `s5cmd`, then
> it
> > > > > > > > > > > probably
> > > > > > > > > > > doesn't
> > > > > > > > > > > make sense to keep `s5cmd` support.
> > > > > > > > > > >
> > > > > > > > > > > > 5. My understanding is that the key takeaway here is
> to
> > > > > > > > > > > > avoid
> > > > > > > > > > > > the
> > > > > > > > > > > > file-by-file stream-based copy used in the vanilla
> > > > > > > > > > > > connector
> > > > > > > > > > > > and
> > > > > > > > > > > > leverage
> > > > > > > > > > > > bulk operations, which PathsCopyingFileSystem[2]
> enables.
> > > > > > > > > > > > This
> > > > > > > > > > > > seems
> > > > > > > > > > > > most
> > > > > > > > > > > > critical during state download on recovery. please
> suggest
> > > > > > > > > > > > if
> > > > > > > > > > > > my
> > > > > > > > > > > > inference
> > > > > > > > > > > > is in right direction
> > > > > > > > > > >
> > > > > > > > > > > Yes, but you should also make the bult transfer
> configurable.
> > > > > > > > > > > How
> > > > > > > > > > > many
> > > > > > > > > > > bulk
> > > > > > > > > > > transfers
> > > > > > > > > > > can be happening in parallel etc.
> > > > > > > > > > >
> > > > > > > > > > > > 6. The warning about `s5cmd` causing OOMs sounds like
> > > > > > > > > > > > indication to
> > > > > > > > > > > > consider `S3TransferManager`[3] implementation,
> which might
> > > > > > > > > > > > offer
> > > > > > > > > > > > more
> > > > > > > > > > > > granular control over buffering and in-flight
> requests. Do
> > > > > > > > > > > > you
> > > > > > > > > > > > think
> > > > > > > > > > > > exploring more on `S3TransferManager` would be
> valuable ?
> > > > > > > > > > >
> > > > > > > > > > > I'm pretty sure if you start hundreds of bulk
> transfers in
> > > > > > > > > > > parallel
> > > > > > > > > > > via
> > > > > > > > > > > the
> > > > > > > > > > > `S3TransferManager` you can get the same problems with
> > > > > > > > > > > running
> > > > > > > > > > > out of
> > > > > > > > > > > memory or exceeding available network throughput. I
> don't
> > > > > > > > > > > know
> > > > > > > > > > > if
> > > > > > > > > > > `S3TransferManager` is better or worse in that regard
> to be
> > > > > > > > > > > honest.
> > > > > > > > > > >
> > > > > > > > > > > > 7. The insight on AWS aggressively dropping packets
> > > > > > > > > > > > instead of
> > > > > > > > > > > > gracefully
> > > > > > > > > > > > throttling is invaluable. Currently i have limited
> > > > > > > > > > > > understanding
> > > > > > > > > > > > on how
> > > > > > > > > > > > aws
> > > > > > > > > > > > behaves at throttling I will deep dive more into it
> and
> > > > > > > > > > > > look for clarification based on findings or doubt. To
> > > > > > > > > > > > counter
> > > > > > > > > > > > this,
> > > > > > > > > > > > were
> > > > > > > > > > > > you thinking of a configurable rate limiter within
> the
> > > > > > > > > > > > filesystem
> > > > > > > > > > > > itself
> > > > > > > > > > > > (e.g., setting max bandwidth or max concurrent
> requests),
> > > > > > > > > > > > or
> > > > > > > > > > > > something
> > > > > > > > > > > > more
> > > > > > > > > > > > dynamic that could adapt to network conditions?
> > > > > > > > > > >
> > > > > > > > > > > Flat rate limiting is tricky because AWS offers burst
> network
> > > > > > > > > > > capacity,
> > > > > > > > > > > which
> > > > > > > > > > > comes very handy, and in the vast majority of cases
> works
> > > > > > > > > > > fine.
> > > > > > > > > > > But
> > > > > > > > > > > for
> > > > > > > > > > > some jobs
> > > > > > > > > > > if you exceed that burst capacity, AWS starts dropping
> your
> > > > > > > > > > > packets
> > > > > > > > > > > and
> > > > > > > > > > > then the
> > > > > > > > > > > problems happen. On the other hand, if rate limit to
> your
> > > > > > > > > > > normal
> > > > > > > > > > > capacity,
> > > > > > > > > > > you
> > > > > > > > > > > are leaving a lot of network throughput unused during
> > > > > > > > > > > recoveries.
> > > > > > > > > > >
> > > > > > > > > > > At the same time AWS doesn't share details for the
> burst
> > > > > > > > > > > capacity, so
> > > > > > > > > > > it's
> > > > > > > > > > > sometimes
> > > > > > > > > > > tricky to configure the whole system properly. I don't
> have
> > > > > > > > > > > an
> > > > > > > > > > > universal
> > > > > > > > > > > good answer
> > > > > > > > > > > for that :(
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Piotrek
> > > > > > > > > > >
> > > > > > > > > > > wt., 21 paź 2025 o 21:40 Samrat Deb
> [email protected]
> > > > > > > > > > > napisał(a):
> > > > > > > > > > >
> > > > > > > > > > > > Hi Gabor/ Ferenc
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for sharing the pointer and valuable
> feedback.
> > > > > > > > > > > >
> > > > > > > > > > > > The link to the custom `XmlResponsesSaxParser`[1]
> looks
> > > > > > > > > > > > scary
> > > > > > > > > > > > 😦
> > > > > > > > > > > > and contains hidden complexity.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Could you share some context on why this custom
> parser
> > > > > > > > > > > > was
> > > > > > > > > > > > necessary?
> > > > > > > > > > > > Was it to work around a specific bug, a performance
> issue,
> > > > > > > > > > > > or
> > > > > > > > > > > > an
> > > > > > > > > > > > inconsistency in the S3 XML API responses that the
> default
> > > > > > > > > > > > AWS
> > > > > > > > > > > > SDK
> > > > > > > > > > > > parser
> > > > > > > > > > > > couldn't handle at the time? With sdk v2 what are
> core
> > > > > > > > > > > > functionality
> > > > > > > > > > > > that
> > > > > > > > > > > > is required to be intensively tested ?
> > > > > > > > > > > >
> > > > > > > > > > > > 2. You mentioned it has no Hadoop dependency, which
> is
> > > > > > > > > > > > great
> > > > > > > > > > > > news.
> > > > > > > > > > > > For
> > > > > > > > > > > > a
> > > > > > > > > > > > new native S3 connector, would integration simply
> require
> > > > > > > > > > > > implementing
> > > > > > > > > > > > a
> > > > > > > > > > > > new S3DelegationTokenProvider/Receiver pair using
> the AWS
> > > > > > > > > > > > SDK,
> > > > > > > > > > > > or
> > > > > > > > > > > > are
> > > > > > > > > > > > there
> > > > > > > > > > > > more subtle integration points with the framework
> that
> > > > > > > > > > > > should
> > > > > > > > > > > > be
> > > > > > > > > > > > accounted?
> > > > > > > > > > > >
> > > > > > > > > > > > 3. I remember solving Serialized Throwable exception
> issue
> > > > > > > > > > > > [2]
> > > > > > > > > > > > leading
> > > > > > > > > > > > to
> > > > > > > > > > > > a new bug [3], where an initial fix led to a
> regression
> > > > > > > > > > > > that
> > > > > > > > > > > > Gabor
> > > > > > > > > > > > later
> > > > > > > > > > > > solved with Ferenc providing a detailed root cause
> insights
> > > > > > > > > > > > [4]
> > > > > > > > > > > > 😅.
> > > > > > > > > > > > Its hard to fully sure that all scenarios are covered
> > > > > > > > > > > > properly.
> > > > > > > > > > > > This is
> > > > > > > > > > > > one
> > > > > > > > > > > > of the example, there can be other unknowns.
> > > > > > > > > > > > what would be the best approach to test for and
> prevent
> > > > > > > > > > > > such
> > > > > > > > > > > > regressions
> > > > > > > > > > > > or
> > > > > > > > > > > > unknown unknowns, especially in the most sensitive
> parts of
> > > > > > > > > > > > the
> > > > > > > > > > > > filesystem
> > > > > > > > > > > > logic?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Samrat
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > >
> > > >
> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java
> > > >
> > > > > > > > > > > > [2]
> https://issues.apache.org/jira/browse/FLINK-28513
> > > > > > > > > > > > [3] https://github.com/apache/flink/pull/25231
> > > > > > > > > > > > [4]
> > > > > > > > > > > >
> https://github.com/apache/flink/pull/25231#issuecomment-2312059662
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, 21 Oct 2025 at 3:49 PM, Gabor Somogyi <
> > > > > > > > > > > > [email protected]
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Samrat,
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 on the direction that we move away from hadoop.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is a long standing discussion to replace the
> > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > 2
> > > > > > > > > > > > > connectors
> > > > > > > > > > > > > with something better.
> > > > > > > > > > > > > Both of them has it's own weaknesses, I've fixed
> several
> > > > > > > > > > > > > blockers
> > > > > > > > > > > > > inside
> > > > > > > > > > > > > them.
> > > > > > > > > > > > >
> > > > > > > > > > > > > There are definitely magic inside them, please see
> this
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > > for
> > > > > > > > > > > > > example
> > > > > > > > > > > > > and
> > > > > > > > > > > > > there are more🙂
> > > > > > > > > > > > > I think the most sensitive part is the recovery
> because
> > > > > > > > > > > > > hard
> > > > > > > > > > > > > to
> > > > > > > > > > > > > test
> > > > > > > > > > > > > all
> > > > > > > > > > > > > cases.
> > > > > > > > > > > > >
> > > > > > > > > > > > > @Ferenc
> > > > > > > > > > > > >
> > > > > > > > > > > > > > One thing that comes to my mind that will need
> some
> > > > > > > > > > > > > > changes
> > > > > > > > > > > > > > and its
> > > > > > > > > > > > > > involvement
> > > > > > > > > > > > > > to this change is not trivial is the delegation
> token
> > > > > > > > > > > > > > framework.
> > > > > > > > > > > > > > Currently
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > is also tied to the Hadoop stuff and has some
> abstract
> > > > > > > > > > > > > > classes
> > > > > > > > > > > > > > in the
> > > > > > > > > > > > > > base
> > > > > > > > > > > > > > S3 FS
> > > > > > > > > > > > > > module.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The delegation token framework has no dependency on
> > > > > > > > > > > > > hadoop
> > > > > > > > > > > > > so
> > > > > > > > > > > > > there
> > > > > > > > > > > > > is
> > > > > > > > > > > > > no
> > > > > > > > > > > > > blocker on the road,
> > > > > > > > > > > > > but I'm here to help if any question appears.
> > > > > > > > > > > > >
> > > > > > > > > > > > > BR,
> > > > > > > > > > > > > G
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > >
> > > >
> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java#L95-L104
> > > >
> > > > > > > > > > > > > On Tue, Oct 14, 2025 at 8:19 PM Samrat Deb
> > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Poorvank (cc'ed) and I are writing to start a
> > > > > > > > > > > > > > discussion
> > > > > > > > > > > > > > about
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > potential
> > > > > > > > > > > > > > improvement for Flink, creating a new, native S3
> > > > > > > > > > > > > > filesystem
> > > > > > > > > > > > > > independent
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > Hadoop/Presto.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The goal of this proposal is to address several
> > > > > > > > > > > > > > challenges
> > > > > > > > > > > > > > related
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > Flink's S3 integration, simplifying
> > > > > > > > > > > > > > flink-s3-filesystem.
> > > > > > > > > > > > > > If
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > discussion
> > > > > > > > > > > > > > gains positive traction, the next step would be
> to move
> > > > > > > > > > > > > > forward
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > formalised FLIP.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The Challenges with the Current S3 Connectors
> > > > > > > > > > > > > > Currently, Flink offers two primary S3
> filesystems,
> > > > > > > > > > > > > > flink-s3-fs-hadoop[1]
> > > > > > > > > > > > > > and flink-s3-fs-presto[2]. While functional, this
> > > > > > > > > > > > > > dual-connector
> > > > > > > > > > > > > > approach
> > > > > > > > > > > > > > has few issues:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. The flink-s3-fs-hadoop connector adds an
> additional
> > > > > > > > > > > > > > dependency
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > manage. Upgrades like AWS SDK v2 are more
> dependent on
> > > > > > > > > > > > > > Hadoop/Presto
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > support first and leverage in
> flink-s3-filesystem.
> > > > > > > > > > > > > > Sometimes
> > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > restrictive to leverage features directly from
> the AWS
> > > > > > > > > > > > > > SDK.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. The flink-s3-fs-presto connector was
> introduced to
> > > > > > > > > > > > > > mitigate
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > performance issues of the Hadoop connector,
> especially
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > checkpointing.
> > > > > > > > > > > > > > However, it lacks a RecoverableWriter
> implementation.
> > > > > > > > > > > > > > Sometimes it's confusing for Flink users,
> highlighting
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > need
> > > > > > > > > > > > > > for a
> > > > > > > > > > > > > > single, unified solution.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Proposed Solution:
> > > > > > > > > > > > > > A Native, Hadoop-Free S3 Filesystem
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I propose we develop a new filesystem, let's
> call it
> > > > > > > > > > > > > > flink-s3-fs-native,
> > > > > > > > > > > > > > built directly on the modern AWS SDK for Java
> v2. This
> > > > > > > > > > > > > > approach
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > free of any Hadoop or Presto dependencies. I
> have done
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > small
> > > > > > > > > > > > > > prototype
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > validate [3]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is motivated by trino<>s3 [4]. The Trino
> project
> > > > > > > > > > > > > > successfully
> > > > > > > > > > > > > > undertook a similar migration, moving from
> Hadoop-based
> > > > > > > > > > > > > > object
> > > > > > > > > > > > > > storage
> > > > > > > > > > > > > > clients to their own native implementations.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The new Flink S3 filesystem would:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. Provide a single, unified connector for all S3
> > > > > > > > > > > > > > interactions,
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > state
> > > > > > > > > > > > > > backends to sinks.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. Implement a high-performance
> S3RecoverableWriter
> > > > > > > > > > > > > > using
> > > > > > > > > > > > > > S3's
> > > > > > > > > > > > > > Multipart
> > > > > > > > > > > > > > Upload feature, ensuring exactly-once sink
> semantics.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 3. Offer a clean, self-contained dependency,
> > > > > > > > > > > > > > drastically
> > > > > > > > > > > > > > simplifying
> > > > > > > > > > > > > > setup
> > > > > > > > > > > > > > and eliminating external dependencies.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > A Phased Migration Path
> > > > > > > > > > > > > > To ensure a smooth transition, we could adopt a
> phased
> > > > > > > > > > > > > > approach on
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > very
> > > > > > > > > > > > > > high level :
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Phase 1:
> > > > > > > > > > > > > > Introduce the new native S3 filesystem as an
> optional,
> > > > > > > > > > > > > > parallel
> > > > > > > > > > > > > > plugin.
> > > > > > > > > > > > > > This would allow for community testing and
> adoption
> > > > > > > > > > > > > > without
> > > > > > > > > > > > > > breaking
> > > > > > > > > > > > > > existing setups.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Phase 2:
> > > > > > > > > > > > > > Once the native connector achieves feature
> parity and
> > > > > > > > > > > > > > proven
> > > > > > > > > > > > > > stability,
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > will update the documentation to recommend it as
> the
> > > > > > > > > > > > > > default
> > > > > > > > > > > > > > choice
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > S3 use cases.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Phase 3:
> > > > > > > > > > > > > > In a future major release, the legacy
> > > > > > > > > > > > > > flink-s3-fs-hadoop
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > flink-s3-fs-presto connectors could be formally
> > > > > > > > > > > > > > deprecated,
> > > > > > > > > > > > > > with
> > > > > > > > > > > > > > clear
> > > > > > > > > > > > > > migration guides provided for users.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I would love to hear the community's thoughts on
> this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > A few questions to start the discussion:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. What are the biggest pain points with the
> current S3
> > > > > > > > > > > > > > filesystem?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. Are there any critical features from the
> Hadoop S3A
> > > > > > > > > > > > > > client
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > essential to replicate in a native
> implementation?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 3. Would a simplified, non-dependent S3
> experience be a
> > > > > > > > > > > > > > valuable
> > > > > > > > > > > > > > improvement for Flink use cases?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > Samrat
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > >
> > > >
> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-hadoop
> > > >
> > > > > > > > > > > > > > [2]
> > > >
> > > >
> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-presto
> > > >
> > > > > > > > > > > > > > [3] https://github.com/Samrat002/flink/pull/4
> > > > > > > > > > > > > > [4]
> > > >
> > > > https://github.com/trinodb/trino/tree/master/lib/trino-filesystem-s3
>
>

Re: [DISCUSSION] Native S3 Filesystem in Apache Flink

Reply via email to