[
https://issues.apache.org/jira/browse/FLINK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39546:
-----------------------------------
Labels: pull-request-available (was: )
> Improve observability in flink-s3-fs-native by exposing operation-level S3
> metrics
> ----------------------------------------------------------------------------------
>
> Key: FLINK-39546
> URL: https://issues.apache.org/jira/browse/FLINK-39546
> Project: Flink
> Issue Type: New Feature
> Components: Connectors / FileSystem
> Affects Versions: 2.3.0
> Reporter: Samrat Deb
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.4.0
>
>
> flink-s3-fs-native currently exposes only coarse IO counters. Operators
> cannot see
> per-operation latency, S3 throttling (HTTP 503 SlowDown), retry counts and
> reasons,
> multipart-upload lifecycle, stream reopens, or connection-pool saturation
> through
> Flink's metric system. When the checkpoint duration regresses in production,
> there is
> no Flink signal to attribute the cause to S3 vs the network vs the state
> backend.
> Diagnosing such incidents today requires correlating Flink logs with AWS
> CloudTrail
> or capturing packets, neither scales as a routine operational practice.
> This ticket proposes to bridge AWS SDK v2's built-in MetricPublisher into
> Flink's
> MetricGroup from inside flink-s3-fs-native, plus a small set of
> plugin-specific
> metrics that the SDK cannot see (NativeS3InputStream reopens,
> RecoverableWriter /
> multipart-upload lifecycle).
>
> *Why is this targeted at flink-s3-fs-native specifically?*
> flink-s3-fs-native owns its S3AsyncClient directly and can therefore attach an
> AWS SDK v2 MetricPublisher at client construction. The same approach is not
> available to flink-s3-fs-hadoop and flink-s3-fs-presto, because both delegate
> to
> a Hadoop-owned filesystem org.apache.hadoop.fs.s3a.S3AFileSystem
> and the Presto equivalent, which constructs and owns the S3 client
> internally.
> Hadoop S3A exposes its own IOStatistics framework, which is not available via
> flink-s3-fs-hadoop, nor directly via AWS SDK v2 MetricPublisher.
> Surfacing statistics into Flink would require a separate adapter, a Hadoop
> version
> floor, and is coupled to S3A internals that change across Hadoop releases.
>
> Doing this work in flink-s3-fs-native, has the cleanest dependency footprint
> and the lowest classpath risk.
> cc: [~gsomogyi]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)