Hi Samrat, I've checked it and good from my side.
BR, G On Thu, May 28, 2026 at 10:06 AM Aleksandr Iushmanov <[email protected]> wrote: > Thank you Samrat, > > Looks good to me! > > Kind regards, > Alex > > > On Wed, 27 May 2026 at 17:25, Samrat Deb <[email protected]> wrote: > > > Hi Aleksandr Iushmanov, > > > > > The proposal overall looks good to me, but I have a concern around the > > > number of metrics we enable by default. As you have mentioned in the > doc, > > > the number of added time series is ~50. I have a feeling that enabling > > them > > > by default may lead to unpleasant surprises in terms of extra > cardinality > > > and the volume of exported data unless it is guarded through > allowlists. > > My > > > personal preference would be to keep this option opt-in. > > > > Thank you for the suggestion. The opt-in makes sense. It would allow > users > > to decide the cardinality of metrics within their setup. > > Here is my plan to add changes to the flip > > > > s3.metrics.enabled: true > > > > s3.metrics.allowlist: > > - api_call_count > > > > > > - api_call_duration_ms > > > > > > - throttle_count > > > > > > - retry_count > > > > > > - iops > > > > > > - mpu_aborted_total > > s3.metrics.detailed.enabled: false > > > > > > Best, > > Samrat > > > > > > > > On Fri, May 22, 2026 at 5:26 PM Gabor Somogyi <[email protected] > > > > wrote: > > > > > @Samrat > > > Thanks for the detailed explanation for the metrics usage. > > > > > > Throttling is not supported by the actual implementation even though > > > we plan to add metrics for it. It's good to go however, I'm about to > add > > > throttling support soon. > > > > > > ------------ > > > > > > One small API refinement worth considering: instead of adding a second > > > "configure(Configuration, MetricGroup)" > > > overload toFileSystemFactory, introduce a separate opt-in interface: > > > > > > public interface MetricsAware { > > > void setMetricGroup(MetricGroup metricGroup); > > > } > > > > > > Then inside FileSystem.initialize(): > > > for (FileSystemFactory factory : factories) { > > > if (factory instanceof MetricsAware) { > > > ((MetricsAware) factory).setMetricGroup(metricGroup); > > > } > > > } > > > > > > This keeps FileSystemFactory's contract unchanged, third-party > > > implementations need zero > > > modifications unless they want metrics. The FLIP's default-on > collection > > is > > > fine; this is purely an interface hygiene suggestion. > > > > > > @Aleksandr > > > If opt-in means "s3.metrics.enabled" defaults to "false", I'd say > that's > > > not the way to go. > > > Observability features that require pre-incident configuration tend to > > > never get enabled, > > > which directly defeats the FLIP's stated goal of closing the > operational > > > blindness gap. > > > > > > The concern about cardinality is legitimate, but the math is favorable: > > > these ~50 series are at > > > TM scope, not subtask scope. A 100-TM cluster adds roughly 5,000 series > > > which is modest > > > compared to what operator-level metrics already emit. > > > > > > The right answer is informed default-on with a clear escape hatch. The > > FLIP > > > already has > > > the split between basic (default-on, bounded cardinality) and detailed > > > (opt-in via "s3.metrics.detailed.enabled"). > > > Teams with strict cardinality budgets can also suppress the entire > group > > at > > > the reporter level with a single line: > > > metrics.reporter.<name>.filter.excludes = *.filesystem.*:*:* > > > > > > During performance testing we're intended to measure things in-depth > and > > if > > > something > > > blows up then fine tuning is still a possibilty during PR review. > > > > > > G > > > > > > > > > On Thu, May 21, 2026 at 6:12 PM Aleksandr Iushmanov < > [email protected] > > > > > > wrote: > > > > > > > Hi Samrat, > > > > > > > > Thank you for putting it together. I believe that this is a good > > addition > > > > to ensure that Flink is operation ready. > > > > > > > > The proposal overall looks good to me, but I have a concern around > the > > > > number of metrics we enable by default. As you have mentioned in the > > doc, > > > > the number of added time series is ~50. I have a feeling that > enabling > > > them > > > > by default may lead to unpleasant surprises in terms of extra > > cardinality > > > > and the volume of exported data unless it is guarded through > > allowlists. > > > My > > > > personal preference would be to keep this option opt-in. > > > > > > > > Please let me know your thoughts on this. > > > > > > > > Kind regards, > > > > Alex > > > > > > > > > > > > On Tue, 5 May 2026 at 10:58, Samrat Deb <[email protected]> > wrote: > > > > > > > > > Hi All, > > > > > > > > > > I'd like to open a discussion on FLIP-576: Filesystem-Plugin > > > > Observability > > > > > for (flink-s3-fs-native)[1]. > > > > > > > > > > Apache Flink’s filesystem layer is critical to core operations like > > > > > checkpoints, savepoints, and state access. Most of which rely > heavily > > > on > > > > > S3. Despite this, the current observability in s3<>flink is > offering > > > > little > > > > > insight into underlying issues. Engineers lack visibility into key > > > > failure > > > > > signals, including S3 throttling, retry behaviour, slow operations, > > > load > > > > > distribution, multipart upload leaks, and intermittent stream > > failures. > > > > As > > > > > a result, diagnosing production issues often requires manual > > > correlation > > > > > across logs and external systems, making troubleshooting slow and > > > > > unreliable. This observability gap significantly impacts the > > > operability > > > > of > > > > > Flink in real-world large-scale deployments. > > > > > This FLIP proposal addresses the same and builds support for native > > S3 > > > > FS. > > > > > > > > > > Looking forward to your feedback. > > > > > > > > > > Bests, > > > > > Samrat > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957173 > > > > > > > > > > > > > > >
