Hi Samrat,

Thanks for the proposal, such a feature would be very helpful!

I have several questions:
1. Is it possible to expose file size metrics? It might be helpful to
troubleshoot slow recoveries caused by downloading many small files for
example
2. Is bulkCopyHelper covered by the proposal? I think it would be helpful
to have requests.size() and total bytes received as metrics
3. Ideally, such metrics should be exposed by other file systems; then I'd
suggest having "s3n" as a label rather than a part of metric name

As for the "Open questions for community discussion" section, I agree with
both points:
- enable the feature by default and
- don't correlate with checkpoints (it might be more tricky than
ThreadLocal).

We use something similar to Approach B internally; I don't think it "Adds
overhead to the per-record path"
(because we don't have per-record file operations); but it lacks
lower-level signals indeed.
So the recommended approach makes sense to me.

Regards,
Roman


On Tue, May 5, 2026 at 11:58 AM Samrat Deb <[email protected]> wrote:

> Hi All,
>
> I'd like to open a discussion on FLIP-576: Filesystem-Plugin Observability
> for (flink-s3-fs-native)[1].
>
> Apache Flink’s filesystem layer is critical to core operations like
> checkpoints, savepoints, and state access. Most of which rely heavily on
> S3. Despite this, the current observability in s3<>flink is offering little
> insight into underlying issues. Engineers lack visibility into key failure
> signals, including S3 throttling, retry behaviour, slow operations, load
> distribution, multipart upload leaks, and intermittent stream failures. As
> a result, diagnosing production issues often requires manual correlation
> across logs and external systems, making troubleshooting slow and
> unreliable. This observability gap significantly impacts the operability of
> Flink in real-world large-scale deployments.
> This FLIP proposal addresses the same and builds support for native S3 FS.
>
> Looking forward to your feedback.
>
> Bests,
> Samrat
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173
>

Reply via email to