Hi Samrat, Thank you for putting it together. I believe that this is a good addition to ensure that Flink is operation ready.
The proposal overall looks good to me, but I have a concern around the number of metrics we enable by default. As you have mentioned in the doc, the number of added time series is ~50. I have a feeling that enabling them by default may lead to unpleasant surprises in terms of extra cardinality and the volume of exported data unless it is guarded through allowlists. My personal preference would be to keep this option opt-in. Please let me know your thoughts on this. Kind regards, Alex On Tue, 5 May 2026 at 10:58, Samrat Deb <[email protected]> wrote: > Hi All, > > I'd like to open a discussion on FLIP-576: Filesystem-Plugin Observability > for (flink-s3-fs-native)[1]. > > Apache Flinkās filesystem layer is critical to core operations like > checkpoints, savepoints, and state access. Most of which rely heavily on > S3. Despite this, the current observability in s3<>flink is offering little > insight into underlying issues. Engineers lack visibility into key failure > signals, including S3 throttling, retry behaviour, slow operations, load > distribution, multipart upload leaks, and intermittent stream failures. As > a result, diagnosing production issues often requires manual correlation > across logs and external systems, making troubleshooting slow and > unreliable. This observability gap significantly impacts the operability of > Flink in real-world large-scale deployments. > This FLIP proposal addresses the same and builds support for native S3 FS. > > Looking forward to your feedback. > > Bests, > Samrat > > [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173 >
