[ 
https://issues.apache.org/jira/browse/FLINK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763789#comment-17763789
 ] 

Piotr Nowojski commented on FLINK-23411:
----------------------------------------

Hey, I've just upon this ticket as we are also looking into lack of those 
metrics. Should we really expose point-wise metrics as continuous ones? It's 
following the pattern of {{lastCheckpointDuration}}, but I don't think this is 
a good pattern:
* this is wasteful if {{checkpoint duration >> metrics collecting/reporting 
interval}}
* this is loosing the data if {{checkpoint duration < metric 
collecting/reporting interval}}
* I don't know how this approach will scale with larger jobs (parallelism > 
20). How user/dev ops could actually visualise and analyse hundreds/thousands 
of this kind of metrics for a single checkpoint {{parallelism * number_of_tasks 
* 8 (number of metrics) == a lot}} (800 for parallelism 20 and 5 tasks). 

I think in the long term, this should be exposed as something like [distributed 
tracing|https://newrelic.com/blog/how-to-relic/distributed-tracing-anomaly-detection]
 on the user side, and reported by Flink as a bunch of traces via OTEL. Maybe 
we should expand the current {{MetricReporter}} plugins to support that?

As a stop gap solution, I would propose this FLINK-33071.

> Expose Flink checkpoint details metrics
> ---------------------------------------
>
>                 Key: FLINK-23411
>                 URL: https://issues.apache.org/jira/browse/FLINK-23411
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics
>    Affects Versions: 1.13.1, 1.12.4
>            Reporter: Jun Qin
>            Assignee: Hangxiang Yu
>            Priority: Major
>              Labels: pull-request-available, stale-assigned
>             Fix For: 1.18.0
>
>
> The checkpoint metrics as shown in the Flink Web UI like the 
> sync/async/alignment/start delay are not exposed to the metrics system. This 
> makes problem investigation harder when Web UI is not enabled: those numbers 
> can not get in the DEBUG logs. I think we should see how we can expose 
> metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to