[
https://issues.apache.org/jira/browse/FLINK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763789#comment-17763789
]
Piotr Nowojski commented on FLINK-23411:
----------------------------------------
Hey, I've just upon this ticket as we are also looking into lack of those
metrics. Should we really expose point-wise metrics as continuous ones? It's
following the pattern of {{lastCheckpointDuration}}, but I don't think this is
a good pattern:
* this is wasteful if {{checkpoint duration >> metrics collecting/reporting
interval}}
* this is loosing the data if {{checkpoint duration < metric
collecting/reporting interval}}
* I don't know how this approach will scale with larger jobs (parallelism >
20). How user/dev ops could actually visualise and analyse hundreds/thousands
of this kind of metrics for a single checkpoint {{parallelism * number_of_tasks
* 8 (number of metrics) == a lot}} (800 for parallelism 20 and 5 tasks).
I think in the long term, this should be exposed as something like [distributed
tracing|https://newrelic.com/blog/how-to-relic/distributed-tracing-anomaly-detection]
on the user side, and reported by Flink as a bunch of traces via OTEL. Maybe
we should expand the current {{MetricReporter}} plugins to support that?
As a stop gap solution, I would propose this FLINK-33071.
> Expose Flink checkpoint details metrics
> ---------------------------------------
>
> Key: FLINK-23411
> URL: https://issues.apache.org/jira/browse/FLINK-23411
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Metrics
> Affects Versions: 1.13.1, 1.12.4
> Reporter: Jun Qin
> Assignee: Hangxiang Yu
> Priority: Major
> Labels: pull-request-available, stale-assigned
> Fix For: 1.18.0
>
>
> The checkpoint metrics as shown in the Flink Web UI like the
> sync/async/alignment/start delay are not exposed to the metrics system. This
> makes problem investigation harder when Web UI is not enabled: those numbers
> can not get in the DEBUG logs. I think we should see how we can expose
> metrics.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)