Hi Samrat, I think it's a good addition in general. If I understand correctly then the overall plan is to use subset of AWS connector metrics as-is plus calculate some own and channel all those into Flink's metrics system, right?
I've read the suggested initial metrics. It mentions what kind of problems we want to highlight with them but what I miss is an explanation how. If you could add some lightweight explanation or example it would be awesome. The other question of mine would be IOPS measurement. Could we add that? There are S3 implementations which are sensitive to certain IOPS consumption. BR, G On Tue, May 5, 2026 at 11:58 AM Samrat Deb <[email protected]> wrote: > Hi All, > > I'd like to open a discussion on FLIP-576: Filesystem-Plugin Observability > for (flink-s3-fs-native)[1]. > > Apache Flink’s filesystem layer is critical to core operations like > checkpoints, savepoints, and state access. Most of which rely heavily on > S3. Despite this, the current observability in s3<>flink is offering little > insight into underlying issues. Engineers lack visibility into key failure > signals, including S3 throttling, retry behaviour, slow operations, load > distribution, multipart upload leaks, and intermittent stream failures. As > a result, diagnosing production issues often requires manual correlation > across logs and external systems, making troubleshooting slow and > unreliable. This observability gap significantly impacts the operability of > Flink in real-world large-scale deployments. > This FLIP proposal addresses the same and builds support for native S3 FS. > > Looking forward to your feedback. > > Bests, > Samrat > > [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173 >
