Hi Gabor, Thanks for the review.
MetricsGroup is a part of the AWS SDK v2. With this feature, flink-s3-fs-native will be able to expose the desired metrics, and user will be able to configure different metrices as required. > I've read the suggested initial metrics. It mentions what kind of problems > we want to highlight with them > but what I miss is an explanation how. If you could add some lightweight > explanation or example it would be awesome. Sure. i will add the explanation / example > The other question of mine would be IOPS measurement. Could we add that? Yes, IOPS is an important metric to track. Feature offers new metrics are easy to add and user-configurable Bests, Samrat On Thu, May 14, 2026 at 12:37 PM Gabor Somogyi <[email protected]> wrote: > Hi Samrat, > > I think it's a good addition in general. If I understand correctly then the > overall plan is to use subset of AWS connector metrics as-is > plus calculate some own and channel all those into Flink's metrics system, > right? > > I've read the suggested initial metrics. It mentions what kind of problems > we want to highlight with them > but what I miss is an explanation how. If you could add some lightweight > explanation or example it would be awesome. > > The other question of mine would be IOPS measurement. Could we add that? > There are S3 implementations > which are sensitive to certain IOPS consumption. > > BR, > G > > > On Tue, May 5, 2026 at 11:58 AM Samrat Deb <[email protected]> wrote: > > > Hi All, > > > > I'd like to open a discussion on FLIP-576: Filesystem-Plugin > Observability > > for (flink-s3-fs-native)[1]. > > > > Apache Flink’s filesystem layer is critical to core operations like > > checkpoints, savepoints, and state access. Most of which rely heavily on > > S3. Despite this, the current observability in s3<>flink is offering > little > > insight into underlying issues. Engineers lack visibility into key > failure > > signals, including S3 throttling, retry behaviour, slow operations, load > > distribution, multipart upload leaks, and intermittent stream failures. > As > > a result, diagnosing production issues often requires manual correlation > > across logs and external systems, making troubleshooting slow and > > unreliable. This observability gap significantly impacts the operability > of > > Flink in real-world large-scale deployments. > > This FLIP proposal addresses the same and builds support for native S3 > FS. > > > > Looking forward to your feedback. > > > > Bests, > > Samrat > > > > [1] > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173 > > >
