Thank you all .

I have started a voting thread for the FLIP.

Bests,
Samrat

On Thu, May 28, 2026 at 6:48 PM Gabor Somogyi <[email protected]>
wrote:

> Hi Samrat,
>
> I've checked it and good from my side.
>
> BR,
> G
>
>
> On Thu, May 28, 2026 at 10:06 AM Aleksandr Iushmanov <[email protected]>
> wrote:
>
> > Thank you Samrat,
> >
> > Looks good to me!
> >
> > Kind regards,
> > Alex
> >
> >
> > On Wed, 27 May 2026 at 17:25, Samrat Deb <[email protected]> wrote:
> >
> > > Hi Aleksandr Iushmanov,
> > >
> > > > The proposal overall looks good to me, but I have a concern around
> the
> > > > number of metrics we enable by default. As you have mentioned in the
> > doc,
> > > > the number of added time series is ~50. I have a feeling that
> enabling
> > > them
> > > > by default may lead to unpleasant surprises in terms of extra
> > cardinality
> > > > and the volume of exported data unless it is guarded through
> > allowlists.
> > > My
> > > > personal preference would be to keep this option opt-in.
> > >
> > > Thank you for the suggestion. The opt-in makes sense. It would allow
> > users
> > > to decide the cardinality of metrics within their setup.
> > > Here is my plan to add changes to the flip
> > >
> > >   s3.metrics.enabled: true
> > >
> > >   s3.metrics.allowlist:
> > >      - api_call_count
> > >
> > >
> > >      - api_call_duration_ms
> > >
> > >
> > >      - throttle_count
> > >
> > >
> > >      - retry_count
> > >
> > >
> > >      - iops
> > >
> > >
> > >      - mpu_aborted_total
> > >  s3.metrics.detailed.enabled: false
> > >
> > >
> > > Best,
> > > Samrat
> > >
> > >
> > >
> > > On Fri, May 22, 2026 at 5:26 PM Gabor Somogyi <
> [email protected]
> > >
> > > wrote:
> > >
> > > > @Samrat
> > > > Thanks for the detailed explanation for the metrics usage.
> > > >
> > > > Throttling is not supported by the actual implementation even though
> > > > we plan to add metrics for it. It's good to go however, I'm about to
> > add
> > > > throttling support soon.
> > > >
> > > > ------------
> > > >
> > > > One small API refinement worth considering: instead of adding a
> second
> > > > "configure(Configuration, MetricGroup)"
> > > > overload toFileSystemFactory, introduce a separate opt-in interface:
> > > >
> > > > public interface MetricsAware {
> > > >     void setMetricGroup(MetricGroup metricGroup);
> > > > }
> > > >
> > > > Then inside FileSystem.initialize():
> > > > for (FileSystemFactory factory : factories) {
> > > >     if (factory instanceof MetricsAware) {
> > > >         ((MetricsAware) factory).setMetricGroup(metricGroup);
> > > >     }
> > > > }
> > > >
> > > > This keeps FileSystemFactory's contract unchanged, third-party
> > > > implementations need zero
> > > > modifications unless they want metrics. The FLIP's default-on
> > collection
> > > is
> > > > fine; this is purely an interface hygiene suggestion.
> > > >
> > > > @Aleksandr
> > > > If opt-in means "s3.metrics.enabled" defaults to "false", I'd say
> > that's
> > > > not the way to go.
> > > > Observability features that require pre-incident configuration tend
> to
> > > > never get enabled,
> > > > which directly defeats the FLIP's stated goal of closing the
> > operational
> > > > blindness gap.
> > > >
> > > > The concern about cardinality is legitimate, but the math is
> favorable:
> > > > these ~50 series are at
> > > > TM scope, not subtask scope. A 100-TM cluster adds roughly 5,000
> series
> > > > which is modest
> > > > compared to what operator-level metrics already emit.
> > > >
> > > > The right answer is informed default-on with a clear escape hatch.
> The
> > > FLIP
> > > > already has
> > > > the split between basic (default-on, bounded cardinality) and
> detailed
> > > > (opt-in via "s3.metrics.detailed.enabled").
> > > > Teams with strict cardinality budgets can also suppress the entire
> > group
> > > at
> > > > the reporter level with a single line:
> > > > metrics.reporter.<name>.filter.excludes = *.filesystem.*:*:*
> > > >
> > > > During performance testing we're intended to measure things in-depth
> > and
> > > if
> > > > something
> > > > blows up then fine tuning is still a possibilty during PR review.
> > > >
> > > > G
> > > >
> > > >
> > > > On Thu, May 21, 2026 at 6:12 PM Aleksandr Iushmanov <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > Hi Samrat,
> > > > >
> > > > > Thank you for putting it together. I believe that this is a good
> > > addition
> > > > > to ensure that Flink is operation ready.
> > > > >
> > > > > The proposal overall looks good to me, but I have a concern around
> > the
> > > > > number of metrics we enable by default. As you have mentioned in
> the
> > > doc,
> > > > > the number of added time series is ~50. I have a feeling that
> > enabling
> > > > them
> > > > > by default may lead to unpleasant surprises in terms of extra
> > > cardinality
> > > > > and the volume of exported data unless it is guarded through
> > > allowlists.
> > > > My
> > > > > personal preference would be to keep this option opt-in.
> > > > >
> > > > > Please let me know your thoughts on this.
> > > > >
> > > > > Kind regards,
> > > > > Alex
> > > > >
> > > > >
> > > > > On Tue, 5 May 2026 at 10:58, Samrat Deb <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I'd like to open a discussion on FLIP-576: Filesystem-Plugin
> > > > > Observability
> > > > > > for (flink-s3-fs-native)[1].
> > > > > >
> > > > > > Apache Flink’s filesystem layer is critical to core operations
> like
> > > > > > checkpoints, savepoints, and state access. Most of which rely
> > heavily
> > > > on
> > > > > > S3. Despite this, the current observability in s3<>flink is
> > offering
> > > > > little
> > > > > > insight into underlying issues. Engineers lack visibility into
> key
> > > > > failure
> > > > > > signals, including S3 throttling, retry behaviour, slow
> operations,
> > > > load
> > > > > > distribution, multipart upload leaks, and intermittent stream
> > > failures.
> > > > > As
> > > > > > a result, diagnosing production issues often requires manual
> > > > correlation
> > > > > > across logs and external systems, making troubleshooting slow and
> > > > > > unreliable. This observability gap significantly impacts the
> > > > operability
> > > > > of
> > > > > > Flink in real-world large-scale deployments.
> > > > > > This FLIP proposal addresses the same and builds support for
> native
> > > S3
> > > > > FS.
> > > > > >
> > > > > > Looking forward to your feedback.
> > > > > >
> > > > > > Bests,
> > > > > > Samrat
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to