@Samrat
Thanks for the detailed explanation for the metrics usage.

Throttling is not supported by the actual implementation even though
we plan to add metrics for it. It's good to go however, I'm about to add
throttling support soon.

------------

One small API refinement worth considering: instead of adding a second
"configure(Configuration, MetricGroup)"
overload toFileSystemFactory, introduce a separate opt-in interface:

public interface MetricsAware {
    void setMetricGroup(MetricGroup metricGroup);
}

Then inside FileSystem.initialize():
for (FileSystemFactory factory : factories) {
    if (factory instanceof MetricsAware) {
        ((MetricsAware) factory).setMetricGroup(metricGroup);
    }
}

This keeps FileSystemFactory's contract unchanged, third-party
implementations need zero
modifications unless they want metrics. The FLIP's default-on collection is
fine; this is purely an interface hygiene suggestion.

@Aleksandr
If opt-in means "s3.metrics.enabled" defaults to "false", I'd say that's
not the way to go.
Observability features that require pre-incident configuration tend to
never get enabled,
which directly defeats the FLIP's stated goal of closing the operational
blindness gap.

The concern about cardinality is legitimate, but the math is favorable:
these ~50 series are at
TM scope, not subtask scope. A 100-TM cluster adds roughly 5,000 series
which is modest
compared to what operator-level metrics already emit.

The right answer is informed default-on with a clear escape hatch. The FLIP
already has
the split between basic (default-on, bounded cardinality) and detailed
(opt-in via "s3.metrics.detailed.enabled").
Teams with strict cardinality budgets can also suppress the entire group at
the reporter level with a single line:
metrics.reporter.<name>.filter.excludes = *.filesystem.*:*:*

During performance testing we're intended to measure things in-depth and if
something
blows up then fine tuning is still a possibilty during PR review.

G


On Thu, May 21, 2026 at 6:12 PM Aleksandr Iushmanov <[email protected]>
wrote:

> Hi Samrat,
>
> Thank you for putting it together. I believe that this is a good addition
> to ensure that Flink is operation ready.
>
> The proposal overall looks good to me, but I have a concern around the
> number of metrics we enable by default. As you have mentioned in the doc,
> the number of added time series is ~50. I have a feeling that enabling them
> by default may lead to unpleasant surprises in terms of extra cardinality
> and the volume of exported data unless it is guarded through allowlists. My
> personal preference would be to keep this option opt-in.
>
> Please let me know your thoughts on this.
>
> Kind regards,
> Alex
>
>
> On Tue, 5 May 2026 at 10:58, Samrat Deb <[email protected]> wrote:
>
> > Hi All,
> >
> > I'd like to open a discussion on FLIP-576: Filesystem-Plugin
> Observability
> > for (flink-s3-fs-native)[1].
> >
> > Apache Flink’s filesystem layer is critical to core operations like
> > checkpoints, savepoints, and state access. Most of which rely heavily on
> > S3. Despite this, the current observability in s3<>flink is offering
> little
> > insight into underlying issues. Engineers lack visibility into key
> failure
> > signals, including S3 throttling, retry behaviour, slow operations, load
> > distribution, multipart upload leaks, and intermittent stream failures.
> As
> > a result, diagnosing production issues often requires manual correlation
> > across logs and external systems, making troubleshooting slow and
> > unreliable. This observability gap significantly impacts the operability
> of
> > Flink in real-world large-scale deployments.
> > This FLIP proposal addresses the same and builds support for native S3
> FS.
> >
> > Looking forward to your feedback.
> >
> > Bests,
> > Samrat
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173
> >
>

Reply via email to