Hi all, I have a working POC for the Native S3 filesystem, which is now available as a draft PR [1]. The POC is functional and has been validated in a local setup with Minio. It's important to note that it does not yet have complete test coverage.
The immediate next step is to conduct a comprehensive benchmark to compare its performance against the existing `flink-s3-fs-hadoop` and `flink-s3-fs-presto` implementations. I've had a very meaningful discussion with Piotr Nowojski about this offline. I am grateful for his detailed guidance on defining a rigorous benchmarking strategy, including specific cluster configurations, job workloads, and key metrics for evaluating both checkpoint/recovery performance and pure throughput. I am now drafting a formal benchmark plan based on these specifics and will share it with this thread in the coming days for feedback. Cheers, Samrat [1] https://github.com/apache/flink/pull/27187 On Wed, Oct 29, 2025 at 9:31 PM Samrat Deb <[email protected]> wrote: > thank you Martijn for clarifying . > i will proceed with creating a task. > > Thanks Mate for the pointer to Minio for testing. > minio is good to use for testing . > > > Cheers, > Samrat > > > On Mon, 27 Oct 2025 at 11:55 PM, Mate Czagany <[email protected]> wrote: > >> Hi, >> >> Just to add to the MinIO licensing concerns, I could not see any recent >> change to the license itself, they have changed the license from Apache >> 2.0 >> to AGPL-3.0 in 2021, and the Docker image used by the tests (which is from >> 2022) already contains the AGPL-3.0 license. This should not be an issue >> as >> Flink does not distribute nor makes MinIO available over the network, it's >> only used by the tests. >> >> What's changed recently is that MinIO no longer publishes Docker images to >> the public [1], so it might be worth it to look into using alternative >> solutions in the future, e.g. Garage [2]. >> >> Best regards, >> Mate >> >> [1] https://github.com/minio/minio/issues/21647#issuecomment-3418675115 >> [2] https://garagehq.deuxfleurs.fr/ >> >> On Mon, Oct 27, 2025 at 5:48 PM Ferenc Csaky <[email protected]> >> wrote: >> >> > Hi, >> > >> > Really nice to see people chime into this thread. I agree with Martijn >> > about the >> > development approach. There will be some iterations until we can >> stabilize >> > this anyways, >> > so we can try to shoot getting out a good enough MVP, then fix issues + >> > reach feature >> > parity with the existing implementations on the go. >> > >> > I am not a licensing expert but AFAIK the previous images that were >> > released under the >> > acceptable license can be continued to use. For most integration tests, >> we >> > use an >> > ancient image anyways [1]. There is another place where the latest img >> > gets pulled [2], >> > I guess it would be good to apply an explicit that tag there. But AFAIK >> > they stop >> > publishing to Docker Hub, so I would anticipate we cannot end up pulling >> > an image with >> > a forbidden license. >> > >> > Best, >> > Ferenc >> > >> > [1] >> > >> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-test-utils-parent/flink-test-utils-junit/src/main/java/org/apache/flink/util/DockerImageVersions.java#L39 >> > [2] >> > >> https://github.com/apache/flink/blob/fd1a97768b661f19783afe70d93a0a8d3d625b2a/flink-end-to-end-tests/test-scripts/common_s3_minio.sh#L51 >> > >> > >> > >> > >> > On Sunday, October 26th, 2025 at 22:05, Martijn Visser < >> > [email protected]> wrote: >> > >> > > >> > > >> > > Hi Samrat, >> > > >> > > First of all, thanks for the proposal. It's long overdue to get this >> in a >> > > better state. >> > > >> > > With regards to the schemes, I would say to ship an initial release >> that >> > > does not include support for s3a and s3p, and focus first on getting >> this >> > > new implementation into a stable state. When that's done, as a >> follow-up, >> > > we can consider adding support for s3a and s3p on this implementation, >> > and >> > > when that's there consider deprecating the older implementations. It >> will >> > > probably take multiple releases before we have this in a stable state. >> > > >> > > Not directly related to this, but given that MinIO decided to change >> > their >> > > license, do we also need to refactor existing tests to not use MinIO >> > > anymore but something else? >> > > >> > > Thanks, >> > > >> > > Martijn >> > > >> > > On Sat, Oct 25, 2025 at 1:38 AM Samrat Deb [email protected] >> wrote: >> > > >> > > > Hi all, >> > > > >> > > > One clarifying question regarding the URI schemes: >> > > > >> > > > Currently, the Flink ecosystem uses multiple schemes to >> differentiate >> > > > between S3 implementations: s3a:// for the Hadoop-based connector >> and >> > > > s3p://[1] for the Presto-based one, which is often recommended for >> > > > checkpointing. >> > > > >> > > > A key goal of the proposed flink-s3-fs-native is to unify these >> into a >> > > > single implementation. With that in mind, what should be the >> strategy >> > for >> > > > scheme support? Should the new native s3 filesystem register only >> for >> > the >> > > > simple s3:// scheme, aiming to deprecate the others? Or would it be >> > > > beneficial to also support s3a:// and s3p:// to provide a smoother >> > > > migration path for users who may have these schemes in their >> existing >> > job >> > > > configurations? >> > > > Cheers, >> > > > Samrat >> > > > >> > > > [1] https://github.com/generalui/s3p >> > > > >> > > > On Wed, Oct 22, 2025 at 6:31 PM Piotr Nowojski [email protected] >> > > > wrote: >> > > > >> > > > > Hi Samrat, >> > > > > >> > > > > > 1. Even if the specifics are hazy, could you recall the general >> > > > > > nature of those concerns? For instance, were they related to >> S3's >> > > > > > eventual >> > > > > > consistency model, which has since improved, the atomicity of >> > Multipart >> > > > > > Upload commits, or perhaps complex failure/recovery scenarios >> > during >> > > > > > the >> > > > > > commit phase? >> > > > > >> > > > > and >> > > > > >> > > > > > *8. *The flink-s3-fs-presto connector explicitly throws an >> > > > > > `UnsupportedOperationException` when >> `createRecoverableWriter()` is >> > > > > > called. >> > > > > > Was this a deliberate design choice to keep the Presto connector >> > > > > > lightweight and optimized specifically for checkpointing, or >> were >> > there >> > > > > > other technical challenges that prevented its implementation at >> the >> > > > > > time? >> > > > > > Any context on this would be very helpful >> > > > > >> > > > > I very vaguely remember that at least one of those concerns was >> with >> > > > > respect to how long >> > > > > does it take for the S3 to make some certain operations visible. >> > That you >> > > > > think you have >> > > > > uploaded and committed a file, but in reality it might not be >> > visible for >> > > > > tens of seconds. >> > > > > >> > > > > Sorry, I don't remember more (or even if there was more). I was >> only >> > > > > superficially involved >> > > > > in the S3 connector back then - just participated/overheard some >> > > > > discussions. >> > > > > >> > > > > > 2. It's clear that implementing an efficient >> > > > > > PathsCopyingFileSystem[2] >> > > > > > is >> > > > > > a non-negotiable requirement for performance. Is there any >> > benchmark >> > > > > > numbers available that can be used as reference and evaluate new >> > > > > > implementation deviation ? >> > > > > >> > > > > I only have the numbers that I put in the original Flip [1]. I >> don't >> > > > > remember the benchmark >> > > > > setup, but it must have been something simple. Like just let some >> job >> > > > > accumulate 1GB of state >> > > > > and measure how long the state downloading phase of recovery was >> > taking. >> > > > > >> > > > > > 3. Do you recall the workload characteristics for that PoC? >> > > > > > Specifically, >> > > > > > was the 30-40% performance advantage of s5cmd observed when >> copying >> > > > > > many >> > > > > > small files (like checkpoint state) or larger, multi-gigabyte >> > files? >> > > > > >> > > > > It was just a regular mix of compacted RocksDB sst files, with >> total >> > > > > state >> > > > > size 1 or at most >> > > > > a couple of GBs. So most of the files were around ~64MB or ~128MB, >> > with a >> > > > > couple of >> > > > > smaller L0 files, and maybe one larger L2 file. >> > > > > >> > > > > > 4. The idea of a switchable implementation sounds great. Would >> you >> > > > > > envision this as a configuration flag (e.g., >> > > > > > s3.native.copy.strategy=s5cmd >> > > > > > or s3.native.copy.strategy=sdk) that selects the backend >> > implementation >> > > > > > at >> > > > > > runtime? Also on contrary is it worth adding configuration that >> > exposes >> > > > > > some level of implementation level information ? >> > > > > >> > > > > I think something like that should be fine, assuming that `s5cmd` >> > will >> > > > > again >> > > > > prove significantly faster and/or more cpu efficient. If not, if >> the >> > > > > SDKv2 >> > > > > has >> > > > > already improved and caught up with the `s5cmd`, then it probably >> > doesn't >> > > > > make sense to keep `s5cmd` support. >> > > > > >> > > > > > 5. My understanding is that the key takeaway here is to avoid >> the >> > > > > > file-by-file stream-based copy used in the vanilla connector and >> > > > > > leverage >> > > > > > bulk operations, which PathsCopyingFileSystem[2] enables. This >> > seems >> > > > > > most >> > > > > > critical during state download on recovery. please suggest if my >> > > > > > inference >> > > > > > is in right direction >> > > > > >> > > > > Yes, but you should also make the bult transfer configurable. How >> > many >> > > > > bulk >> > > > > transfers >> > > > > can be happening in parallel etc. >> > > > > >> > > > > > 6. The warning about `s5cmd` causing OOMs sounds like >> indication to >> > > > > > consider `S3TransferManager`[3] implementation, which might >> offer >> > more >> > > > > > granular control over buffering and in-flight requests. Do you >> > think >> > > > > > exploring more on `S3TransferManager` would be valuable ? >> > > > > >> > > > > I'm pretty sure if you start hundreds of bulk transfers in >> parallel >> > via >> > > > > the >> > > > > `S3TransferManager` you can get the same problems with running >> out of >> > > > > memory or exceeding available network throughput. I don't know if >> > > > > `S3TransferManager` is better or worse in that regard to be >> honest. >> > > > > >> > > > > > 7. The insight on AWS aggressively dropping packets instead of >> > > > > > gracefully >> > > > > > throttling is invaluable. Currently i have limited understanding >> > on how >> > > > > > aws >> > > > > > behaves at throttling I will deep dive more into it and >> > > > > > look for clarification based on findings or doubt. To counter >> this, >> > > > > > were >> > > > > > you thinking of a configurable rate limiter within the >> filesystem >> > > > > > itself >> > > > > > (e.g., setting max bandwidth or max concurrent requests), or >> > something >> > > > > > more >> > > > > > dynamic that could adapt to network conditions? >> > > > > >> > > > > Flat rate limiting is tricky because AWS offers burst network >> > capacity, >> > > > > which >> > > > > comes very handy, and in the vast majority of cases works fine. >> But >> > for >> > > > > some jobs >> > > > > if you exceed that burst capacity, AWS starts dropping your >> packets >> > and >> > > > > then the >> > > > > problems happen. On the other hand, if rate limit to your normal >> > > > > capacity, >> > > > > you >> > > > > are leaving a lot of network throughput unused during recoveries. >> > > > > >> > > > > At the same time AWS doesn't share details for the burst >> capacity, so >> > > > > it's >> > > > > sometimes >> > > > > tricky to configure the whole system properly. I don't have an >> > universal >> > > > > good answer >> > > > > for that :( >> > > > > >> > > > > Best, >> > > > > Piotrek >> > > > > >> > > > > wt., 21 paź 2025 o 21:40 Samrat Deb [email protected] >> > napisał(a): >> > > > > >> > > > > > Hi Gabor/ Ferenc >> > > > > > >> > > > > > Thank you for sharing the pointer and valuable feedback. >> > > > > > >> > > > > > The link to the custom `XmlResponsesSaxParser`[1] looks scary 😦 >> > > > > > and contains hidden complexity. >> > > > > > >> > > > > > 1. Could you share some context on why this custom parser was >> > > > > > necessary? >> > > > > > Was it to work around a specific bug, a performance issue, or an >> > > > > > inconsistency in the S3 XML API responses that the default AWS >> SDK >> > > > > > parser >> > > > > > couldn't handle at the time? With sdk v2 what are core >> > functionality >> > > > > > that >> > > > > > is required to be intensively tested ? >> > > > > > >> > > > > > 2. You mentioned it has no Hadoop dependency, which is great >> news. >> > > > > > For >> > > > > > a >> > > > > > new native S3 connector, would integration simply require >> > implementing >> > > > > > a >> > > > > > new S3DelegationTokenProvider/Receiver pair using the AWS SDK, >> or >> > are >> > > > > > there >> > > > > > more subtle integration points with the framework that should be >> > > > > > accounted? >> > > > > > >> > > > > > 3. I remember solving Serialized Throwable exception issue [2] >> > > > > > leading >> > > > > > to >> > > > > > a new bug [3], where an initial fix led to a regression that >> Gabor >> > > > > > later >> > > > > > solved with Ferenc providing a detailed root cause insights [4] >> 😅. >> > > > > > Its hard to fully sure that all scenarios are covered properly. >> > This is >> > > > > > one >> > > > > > of the example, there can be other unknowns. >> > > > > > what would be the best approach to test for and prevent such >> > > > > > regressions >> > > > > > or >> > > > > > unknown unknowns, especially in the most sensitive parts of the >> > > > > > filesystem >> > > > > > logic? >> > > > > > >> > > > > > Cheers, >> > > > > > Samrat >> > > > > > >> > > > > > [1] >> > > > >> > > > >> > >> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java >> > > > >> > > > > > [2] https://issues.apache.org/jira/browse/FLINK-28513 >> > > > > > [3] https://github.com/apache/flink/pull/25231 >> > > > > > [4] >> > https://github.com/apache/flink/pull/25231#issuecomment-2312059662 >> > > > > > >> > > > > > On Tue, 21 Oct 2025 at 3:49 PM, Gabor Somogyi < >> > > > > > [email protected] >> > > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > Hi Samrat, >> > > > > > > >> > > > > > > +1 on the direction that we move away from hadoop. >> > > > > > > >> > > > > > > This is a long standing discussion to replace the mentioned 2 >> > > > > > > connectors >> > > > > > > with something better. >> > > > > > > Both of them has it's own weaknesses, I've fixed several >> blockers >> > > > > > > inside >> > > > > > > them. >> > > > > > > >> > > > > > > There are definitely magic inside them, please see this [1] >> for >> > > > > > > example >> > > > > > > and >> > > > > > > there are more🙂 >> > > > > > > I think the most sensitive part is the recovery because hard >> to >> > test >> > > > > > > all >> > > > > > > cases. >> > > > > > > >> > > > > > > @Ferenc >> > > > > > > >> > > > > > > > One thing that comes to my mind that will need some changes >> > and its >> > > > > > > > involvement >> > > > > > > > to this change is not trivial is the delegation token >> > framework. >> > > > > > > > Currently >> > > > > > > > it >> > > > > > > > is also tied to the Hadoop stuff and has some abstract >> classes >> > in the >> > > > > > > > base >> > > > > > > > S3 FS >> > > > > > > > module. >> > > > > > > >> > > > > > > The delegation token framework has no dependency on hadoop so >> > there >> > > > > > > is >> > > > > > > no >> > > > > > > blocker on the road, >> > > > > > > but I'm here to help if any question appears. >> > > > > > > >> > > > > > > BR, >> > > > > > > G >> > > > > > > >> > > > > > > [1] >> > > > >> > > > >> > >> https://github.com/apache/flink/blob/0e4e6d7082e83f098d0c1a94351babb3ea407aa8/flink-filesystems/flink-s3-fs-base/src/main/java/com/amazonaws/services/s3/model/transform/XmlResponsesSaxParser.java#L95-L104 >> > > > >> > > > > > > On Tue, Oct 14, 2025 at 8:19 PM Samrat Deb >> [email protected] >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi All, >> > > > > > > > >> > > > > > > > Poorvank (cc'ed) and I are writing to start a discussion >> about >> > a >> > > > > > > > potential >> > > > > > > > improvement for Flink, creating a new, native S3 filesystem >> > > > > > > > independent >> > > > > > > > of >> > > > > > > > Hadoop/Presto. >> > > > > > > > >> > > > > > > > The goal of this proposal is to address several challenges >> > related >> > > > > > > > to >> > > > > > > > Flink's S3 integration, simplifying flink-s3-filesystem. If >> > this >> > > > > > > > discussion >> > > > > > > > gains positive traction, the next step would be to move >> forward >> > > > > > > > with >> > > > > > > > a >> > > > > > > > formalised FLIP. >> > > > > > > > >> > > > > > > > The Challenges with the Current S3 Connectors >> > > > > > > > Currently, Flink offers two primary S3 filesystems, >> > > > > > > > flink-s3-fs-hadoop[1] >> > > > > > > > and flink-s3-fs-presto[2]. While functional, this >> > dual-connector >> > > > > > > > approach >> > > > > > > > has few issues: >> > > > > > > > >> > > > > > > > 1. The flink-s3-fs-hadoop connector adds an additional >> > dependency >> > > > > > > > to >> > > > > > > > manage. Upgrades like AWS SDK v2 are more dependent on >> > > > > > > > Hadoop/Presto >> > > > > > > > to >> > > > > > > > support first and leverage in flink-s3-filesystem. Sometimes >> > it's >> > > > > > > > restrictive to leverage features directly from the AWS SDK. >> > > > > > > > >> > > > > > > > 2. The flink-s3-fs-presto connector was introduced to >> mitigate >> > the >> > > > > > > > performance issues of the Hadoop connector, especially for >> > > > > > > > checkpointing. >> > > > > > > > However, it lacks a RecoverableWriter implementation. >> > > > > > > > Sometimes it's confusing for Flink users, highlighting the >> need >> > > > > > > > for a >> > > > > > > > single, unified solution. >> > > > > > > > >> > > > > > > > Proposed Solution: >> > > > > > > > A Native, Hadoop-Free S3 Filesystem >> > > > > > > > >> > > > > > > > I propose we develop a new filesystem, let's call it >> > > > > > > > flink-s3-fs-native, >> > > > > > > > built directly on the modern AWS SDK for Java v2. This >> approach >> > > > > > > > would >> > > > > > > > be >> > > > > > > > free of any Hadoop or Presto dependencies. I have done a >> small >> > > > > > > > prototype >> > > > > > > > to >> > > > > > > > validate [3] >> > > > > > > > >> > > > > > > > This is motivated by trino<>s3 [4]. The Trino project >> > successfully >> > > > > > > > undertook a similar migration, moving from Hadoop-based >> object >> > > > > > > > storage >> > > > > > > > clients to their own native implementations. >> > > > > > > > >> > > > > > > > The new Flink S3 filesystem would: >> > > > > > > > >> > > > > > > > 1. Provide a single, unified connector for all S3 >> interactions, >> > > > > > > > from >> > > > > > > > state >> > > > > > > > backends to sinks. >> > > > > > > > >> > > > > > > > 2. Implement a high-performance S3RecoverableWriter using >> S3's >> > > > > > > > Multipart >> > > > > > > > Upload feature, ensuring exactly-once sink semantics. >> > > > > > > > >> > > > > > > > 3. Offer a clean, self-contained dependency, drastically >> > > > > > > > simplifying >> > > > > > > > setup >> > > > > > > > and eliminating external dependencies. >> > > > > > > > >> > > > > > > > A Phased Migration Path >> > > > > > > > To ensure a smooth transition, we could adopt a phased >> > approach on >> > > > > > > > a >> > > > > > > > very >> > > > > > > > high level : >> > > > > > > > >> > > > > > > > Phase 1: >> > > > > > > > Introduce the new native S3 filesystem as an optional, >> parallel >> > > > > > > > plugin. >> > > > > > > > This would allow for community testing and adoption without >> > > > > > > > breaking >> > > > > > > > existing setups. >> > > > > > > > >> > > > > > > > Phase 2: >> > > > > > > > Once the native connector achieves feature parity and proven >> > > > > > > > stability, >> > > > > > > > we >> > > > > > > > will update the documentation to recommend it as the default >> > choice >> > > > > > > > for >> > > > > > > > all >> > > > > > > > S3 use cases. >> > > > > > > > >> > > > > > > > Phase 3: >> > > > > > > > In a future major release, the legacy flink-s3-fs-hadoop and >> > > > > > > > flink-s3-fs-presto connectors could be formally deprecated, >> > with >> > > > > > > > clear >> > > > > > > > migration guides provided for users. >> > > > > > > > >> > > > > > > > I would love to hear the community's thoughts on this. >> > > > > > > > >> > > > > > > > A few questions to start the discussion: >> > > > > > > > >> > > > > > > > 1. What are the biggest pain points with the current S3 >> > filesystem? >> > > > > > > > >> > > > > > > > 2. Are there any critical features from the Hadoop S3A >> client >> > that >> > > > > > > > are >> > > > > > > > essential to replicate in a native implementation? >> > > > > > > > >> > > > > > > > 3. Would a simplified, non-dependent S3 experience be a >> > valuable >> > > > > > > > improvement for Flink use cases? >> > > > > > > > >> > > > > > > > Cheers, >> > > > > > > > Samrat >> > > > > > > > >> > > > > > > > [1] >> > > > >> > > > >> > >> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-hadoop >> > > > >> > > > > > > > [2] >> > > > >> > > > >> > >> https://github.com/apache/flink/tree/master/flink-filesystems/flink-s3-fs-presto >> > > > >> > > > > > > > [3] https://github.com/Samrat002/flink/pull/4 >> > > > > > > > [4] >> > > > > > > > >> > https://github.com/trinodb/trino/tree/master/lib/trino-filesystem-s3 >> > >> >
