[ 
https://issues.apache.org/jira/browse/FLINK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39546:
-----------------------------------
    Labels: pull-request-available  (was: )

> Improve observability in flink-s3-fs-native by exposing operation-level S3 
> metrics
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-39546
>                 URL: https://issues.apache.org/jira/browse/FLINK-39546
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / FileSystem
>    Affects Versions: 2.3.0
>            Reporter: Samrat Deb
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.4.0
>
>
> flink-s3-fs-native currently exposes only coarse IO counters. Operators 
> cannot see
> per-operation latency, S3 throttling (HTTP 503 SlowDown), retry counts and 
> reasons,
> multipart-upload lifecycle, stream reopens, or connection-pool saturation 
> through
> Flink's metric system. When the checkpoint duration regresses in production, 
> there is
> no Flink signal to attribute the cause to S3 vs the network vs the state 
> backend.
> Diagnosing such incidents today requires correlating Flink logs with AWS 
> CloudTrail
> or capturing packets, neither scales as a routine operational practice.
> This ticket proposes to bridge AWS SDK v2's built-in MetricPublisher into 
> Flink's
> MetricGroup from inside flink-s3-fs-native, plus a small set of 
> plugin-specific
> metrics that the SDK cannot see (NativeS3InputStream reopens, 
> RecoverableWriter /
> multipart-upload lifecycle).
>  
> *Why is this targeted at flink-s3-fs-native specifically?* 
> flink-s3-fs-native owns its S3AsyncClient directly and can therefore attach an
> AWS SDK v2 MetricPublisher at client construction. The same approach is not
> available to flink-s3-fs-hadoop and flink-s3-fs-presto, because both delegate 
> to
> a Hadoop-owned filesystem org.apache.hadoop.fs.s3a.S3AFileSystem
>  and the Presto equivalent, which constructs and owns the S3 client 
> internally.
> Hadoop S3A exposes its own IOStatistics framework, which is not available via 
> flink-s3-fs-hadoop, nor directly via AWS SDK v2 MetricPublisher.
> Surfacing statistics into Flink would require a separate adapter, a Hadoop 
> version
> floor, and is coupled to S3A internals that change across Hadoop releases.
>  
> Doing this work in flink-s3-fs-native, has the cleanest dependency footprint
> and the lowest classpath risk.
> cc: [~gsomogyi] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to