Hi Samrat, Thanks for the proposal, such a feature would be very helpful!
I have several questions: 1. Is it possible to expose file size metrics? It might be helpful to troubleshoot slow recoveries caused by downloading many small files for example 2. Is bulkCopyHelper covered by the proposal? I think it would be helpful to have requests.size() and total bytes received as metrics 3. Ideally, such metrics should be exposed by other file systems; then I'd suggest having "s3n" as a label rather than a part of metric name As for the "Open questions for community discussion" section, I agree with both points: - enable the feature by default and - don't correlate with checkpoints (it might be more tricky than ThreadLocal). We use something similar to Approach B internally; I don't think it "Adds overhead to the per-record path" (because we don't have per-record file operations); but it lacks lower-level signals indeed. So the recommended approach makes sense to me. Regards, Roman On Tue, May 5, 2026 at 11:58 AM Samrat Deb <[email protected]> wrote: > Hi All, > > I'd like to open a discussion on FLIP-576: Filesystem-Plugin Observability > for (flink-s3-fs-native)[1]. > > Apache Flink’s filesystem layer is critical to core operations like > checkpoints, savepoints, and state access. Most of which rely heavily on > S3. Despite this, the current observability in s3<>flink is offering little > insight into underlying issues. Engineers lack visibility into key failure > signals, including S3 throttling, retry behaviour, slow operations, load > distribution, multipart upload leaks, and intermittent stream failures. As > a result, diagnosing production issues often requires manual correlation > across logs and external systems, making troubleshooting slow and > unreliable. This observability gap significantly impacts the operability of > Flink in real-world large-scale deployments. > This FLIP proposal addresses the same and builds support for native S3 FS. > > Looking forward to your feedback. > > Bests, > Samrat > > [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173 >
