Re: [DISCUSS] FLIP-576 Filesystem-Plugin Observability (flink-s3-fs-native)

Samrat Deb Wed, 27 May 2026 09:35:53 -0700

  Hi Gabor,



  Thanks for the careful read and for the interface suggestion. I started


  writing this up as "interface hygiene" and ended up convincing myself


  The change matters for correctness, not just ergonomics.
  I'd like to take it further than your framing.


  > instead of adding a second "configure(Configuration, MetricGroup)"


  > overload to FileSystemFactory, introduce a separate opt-in interface.


  Reason it's more than hygiene: at every FileSystem.initialize caller I


  traced, the runtime ordering is





    TaskManagerRunner.runTaskManagerProcessSecurely


        FileSystem.initialize(config, pluginManager)


        ...


        runTaskManager(...)


            new MetricRegistryImpl(...)





    ClusterEntrypoint


        configureFileSystems -> FileSystem.initialize


        ...


        initializeServices


            createMetricRegistry(...)





  So, at the point FileSystem.initialize fires, MetricRegistry does not


  yet exist. The configure(Configuration, MetricGroup) overload as


  currently drafted in the FLIP, would either have to pass an


  UnregisteredMetricsGroup sentinel (and silently re-register later) or
  force a startup re-ordering across every entrypoint that calls


  FileSystem.initialize TM, JM, CLI, HistoryServer, YARN, k8s, SQL


  gateway. Neither is great.





  Your MetricsAware shape sidesteps this cleanly because it expects


  two-phase init by construction. Concretely, I'd like to land it as:



    - org.apache.flink.core.plugin.MetricsAware (sibling of Plugin):





          public interface MetricsAware {


              void setMetricGroup(MetricGroup metricGroup);
          }



    - FileSystem.initialize(Configuration, PluginManager) unchanged.



    - new FileSystem.attachMetrics(MetricGroup processMetricGroup):


      idempotent; iterates registered factories and calls setMetricGroup
      on the MetricsAware ones. Invoked from
TaskManagerRunner.runTaskManager

      and ClusterEntrypoint.initializeServices, after MetricRegistry is up.





    - Entrypoints with no process MetricGroup (CLI, HistoryServer, YARN


      client, etc.) don't get metrics. That's the correct behaviour for


      those contexts and matches today.





  Contract pinned on MetricsAware so two-phase init isn't ambiguous:


  called at most once, after Plugin.configure(Configuration), before any


  operation that would emit a metric. If never called, the plugin must


  operate normally and emit nothing.



  Placing the marker in o.a.f.core.plugin (not under o.a.f.core.fs)


  keeps it reusable. StateChangelogStorage and DelegationTokenProvider
  have the same gap today and could adopt it later without another SPI


  change each time.



  > The FLIP's default-on collection is fine; this is purely an interface


  > hygiene suggestion.



  On default-on, FYI for the thread: Roman, Alex and I converged offline


  on a hybrid that I think respects both your point and Alex's. Master
  switch defaults to true; an allowlist (s3.metrics.allowlist) defaults


  to five metrics (api_call_count, api_call_duration_ms, throttle_count,


  retry_count, iops) and everything else is opt-in. Detailed mode


  (bucket label, request-size histogram) stays opt-in as today.


  Cardinality with the default allowlist works out to ~20-25 series per


  TM instead of ~50, which should land below most reporter budgets while


  still answering the working set of questions out of the box. I have
folded

  that into the FLIP revision.





  > Throttling is not supported by the actual implementation even though


  > we plan to add metrics for it. It's good to go however, I'm about to
  > add throttling support soon.



  Thanks for the heads-up.

   I have updated the FLIP accordingly
   1. Separate interface
   2. Added allowList metrics support

  Best,


  Samrat

On Wed, May 27, 2026 at 9:55 PM Samrat Deb <[email protected]> wrote:

> Hi Aleksandr Iushmanov,
>
> > The proposal overall looks good to me, but I have a concern around the
> > number of metrics we enable by default. As you have mentioned in the doc,
> > the number of added time series is ~50. I have a feeling that enabling
> them
> > by default may lead to unpleasant surprises in terms of extra cardinality
> > and the volume of exported data unless it is guarded through allowlists.
> My
> > personal preference would be to keep this option opt-in.
>
> Thank you for the suggestion. The opt-in makes sense. It would allow users
> to decide the cardinality of metrics within their setup.
> Here is my plan to add changes to the flip
>
>   s3.metrics.enabled: true
>
>
>   s3.metrics.allowlist:
>      - api_call_count
>
>
>      - api_call_duration_ms
>
>
>      - throttle_count
>
>
>      - retry_count
>
>
>      - iops
>
>
>      - mpu_aborted_total
>  s3.metrics.detailed.enabled: false
>
>
> Best,
> Samrat
>
>
>
> On Fri, May 22, 2026 at 5:26 PM Gabor Somogyi <[email protected]>
> wrote:
>
>> @Samrat
>> Thanks for the detailed explanation for the metrics usage.
>>
>> Throttling is not supported by the actual implementation even though
>> we plan to add metrics for it. It's good to go however, I'm about to add
>> throttling support soon.
>>
>> ------------
>>
>> One small API refinement worth considering: instead of adding a second
>> "configure(Configuration, MetricGroup)"
>> overload toFileSystemFactory, introduce a separate opt-in interface:
>>
>> public interface MetricsAware {
>>     void setMetricGroup(MetricGroup metricGroup);
>> }
>>
>> Then inside FileSystem.initialize():
>> for (FileSystemFactory factory : factories) {
>>     if (factory instanceof MetricsAware) {
>>         ((MetricsAware) factory).setMetricGroup(metricGroup);
>>     }
>> }
>>
>> This keeps FileSystemFactory's contract unchanged, third-party
>> implementations need zero
>> modifications unless they want metrics. The FLIP's default-on collection
>> is
>> fine; this is purely an interface hygiene suggestion.
>>
>> @Aleksandr
>> If opt-in means "s3.metrics.enabled" defaults to "false", I'd say that's
>> not the way to go.
>> Observability features that require pre-incident configuration tend to
>> never get enabled,
>> which directly defeats the FLIP's stated goal of closing the operational
>> blindness gap.
>>
>> The concern about cardinality is legitimate, but the math is favorable:
>> these ~50 series are at
>> TM scope, not subtask scope. A 100-TM cluster adds roughly 5,000 series
>> which is modest
>> compared to what operator-level metrics already emit.
>>
>> The right answer is informed default-on with a clear escape hatch. The
>> FLIP
>> already has
>> the split between basic (default-on, bounded cardinality) and detailed
>> (opt-in via "s3.metrics.detailed.enabled").
>> Teams with strict cardinality budgets can also suppress the entire group
>> at
>> the reporter level with a single line:
>> metrics.reporter.<name>.filter.excludes = *.filesystem.*:*:*
>>
>> During performance testing we're intended to measure things in-depth and
>> if
>> something
>> blows up then fine tuning is still a possibilty during PR review.
>>
>> G
>>
>>
>> On Thu, May 21, 2026 at 6:12 PM Aleksandr Iushmanov <[email protected]>
>> wrote:
>>
>> > Hi Samrat,
>> >
>> > Thank you for putting it together. I believe that this is a good
>> addition
>> > to ensure that Flink is operation ready.
>> >
>> > The proposal overall looks good to me, but I have a concern around the
>> > number of metrics we enable by default. As you have mentioned in the
>> doc,
>> > the number of added time series is ~50. I have a feeling that enabling
>> them
>> > by default may lead to unpleasant surprises in terms of extra
>> cardinality
>> > and the volume of exported data unless it is guarded through
>> allowlists. My
>> > personal preference would be to keep this option opt-in.
>> >
>> > Please let me know your thoughts on this.
>> >
>> > Kind regards,
>> > Alex
>> >
>> >
>> > On Tue, 5 May 2026 at 10:58, Samrat Deb <[email protected]> wrote:
>> >
>> > > Hi All,
>> > >
>> > > I'd like to open a discussion on FLIP-576: Filesystem-Plugin
>> > Observability
>> > > for (flink-s3-fs-native)[1].
>> > >
>> > > Apache Flink’s filesystem layer is critical to core operations like
>> > > checkpoints, savepoints, and state access. Most of which rely heavily
>> on
>> > > S3. Despite this, the current observability in s3<>flink is offering
>> > little
>> > > insight into underlying issues. Engineers lack visibility into key
>> > failure
>> > > signals, including S3 throttling, retry behaviour, slow operations,
>> load
>> > > distribution, multipart upload leaks, and intermittent stream
>> failures.
>> > As
>> > > a result, diagnosing production issues often requires manual
>> correlation
>> > > across logs and external systems, making troubleshooting slow and
>> > > unreliable. This observability gap significantly impacts the
>> operability
>> > of
>> > > Flink in real-world large-scale deployments.
>> > > This FLIP proposal addresses the same and builds support for native S3
>> > FS.
>> > >
>> > > Looking forward to your feedback.
>> > >
>> > > Bests,
>> > > Samrat
>> > >
>> > > [1]
>> > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-576 Filesystem-Plugin Observability (flink-s3-fs-native)

Reply via email to