Re: [DISCUSS] FLIP-576 Filesystem-Plugin Observability (flink-s3-fs-native)

Samrat Deb Wed, 27 May 2026 09:25:59 -0700

Hi Aleksandr Iushmanov,

> The proposal overall looks good to me, but I have a concern around the
> number of metrics we enable by default. As you have mentioned in the doc,
> the number of added time series is ~50. I have a feeling that enabling
them
> by default may lead to unpleasant surprises in terms of extra cardinality
> and the volume of exported data unless it is guarded through allowlists.
My
> personal preference would be to keep this option opt-in.


Thank you for the suggestion. The opt-in makes sense. It would allow users
to decide the cardinality of metrics within their setup.
Here is my plan to add changes to the flip

  s3.metrics.enabled: true

  s3.metrics.allowlist:
     - api_call_count


     - api_call_duration_ms


     - throttle_count


     - retry_count


     - iops


     - mpu_aborted_total
 s3.metrics.detailed.enabled: false


Best,
Samrat



On Fri, May 22, 2026 at 5:26 PM Gabor Somogyi <[email protected]>
wrote:

> @Samrat
> Thanks for the detailed explanation for the metrics usage.
>
> Throttling is not supported by the actual implementation even though
> we plan to add metrics for it. It's good to go however, I'm about to add
> throttling support soon.
>
> ------------
>
> One small API refinement worth considering: instead of adding a second
> "configure(Configuration, MetricGroup)"
> overload toFileSystemFactory, introduce a separate opt-in interface:
>
> public interface MetricsAware {
>     void setMetricGroup(MetricGroup metricGroup);
> }
>
> Then inside FileSystem.initialize():
> for (FileSystemFactory factory : factories) {
>     if (factory instanceof MetricsAware) {
>         ((MetricsAware) factory).setMetricGroup(metricGroup);
>     }
> }
>
> This keeps FileSystemFactory's contract unchanged, third-party
> implementations need zero
> modifications unless they want metrics. The FLIP's default-on collection is
> fine; this is purely an interface hygiene suggestion.
>
> @Aleksandr
> If opt-in means "s3.metrics.enabled" defaults to "false", I'd say that's
> not the way to go.
> Observability features that require pre-incident configuration tend to
> never get enabled,
> which directly defeats the FLIP's stated goal of closing the operational
> blindness gap.
>
> The concern about cardinality is legitimate, but the math is favorable:
> these ~50 series are at
> TM scope, not subtask scope. A 100-TM cluster adds roughly 5,000 series
> which is modest
> compared to what operator-level metrics already emit.
>
> The right answer is informed default-on with a clear escape hatch. The FLIP
> already has
> the split between basic (default-on, bounded cardinality) and detailed
> (opt-in via "s3.metrics.detailed.enabled").
> Teams with strict cardinality budgets can also suppress the entire group at
> the reporter level with a single line:
> metrics.reporter.<name>.filter.excludes = *.filesystem.*:*:*
>
> During performance testing we're intended to measure things in-depth and if
> something
> blows up then fine tuning is still a possibilty during PR review.
>
> G
>
>
> On Thu, May 21, 2026 at 6:12 PM Aleksandr Iushmanov <[email protected]>
> wrote:
>
> > Hi Samrat,
> >
> > Thank you for putting it together. I believe that this is a good addition
> > to ensure that Flink is operation ready.
> >
> > The proposal overall looks good to me, but I have a concern around the
> > number of metrics we enable by default. As you have mentioned in the doc,
> > the number of added time series is ~50. I have a feeling that enabling
> them
> > by default may lead to unpleasant surprises in terms of extra cardinality
> > and the volume of exported data unless it is guarded through allowlists.
> My
> > personal preference would be to keep this option opt-in.
> >
> > Please let me know your thoughts on this.
> >
> > Kind regards,
> > Alex
> >
> >
> > On Tue, 5 May 2026 at 10:58, Samrat Deb <[email protected]> wrote:
> >
> > > Hi All,
> > >
> > > I'd like to open a discussion on FLIP-576: Filesystem-Plugin
> > Observability
> > > for (flink-s3-fs-native)[1].
> > >
> > > Apache Flink’s filesystem layer is critical to core operations like
> > > checkpoints, savepoints, and state access. Most of which rely heavily
> on
> > > S3. Despite this, the current observability in s3<>flink is offering
> > little
> > > insight into underlying issues. Engineers lack visibility into key
> > failure
> > > signals, including S3 throttling, retry behaviour, slow operations,
> load
> > > distribution, multipart upload leaks, and intermittent stream failures.
> > As
> > > a result, diagnosing production issues often requires manual
> correlation
> > > across logs and external systems, making troubleshooting slow and
> > > unreliable. This observability gap significantly impacts the
> operability
> > of
> > > Flink in real-world large-scale deployments.
> > > This FLIP proposal addresses the same and builds support for native S3
> > FS.
> > >
> > > Looking forward to your feedback.
> > >
> > > Bests,
> > > Samrat
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173
> > >
> >
>

Re: [DISCUSS] FLIP-576 Filesystem-Plugin Observability (flink-s3-fs-native)

Reply via email to