[ https://issues.apache.org/jira/browse/FLINK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763789#comment-17763789 ]
Piotr Nowojski edited comment on FLINK-23411 at 9/11/23 3:51 PM: ----------------------------------------------------------------- Hey, I've just upon this ticket. Should we really expose point-wise metrics as continuous ones? It's following the pattern of {{lastCheckpointDuration}}, but I don't think this is a good pattern: * This is wasteful if {{checkpoint duration >> metrics collecting/reporting interval}}. We will report the same values many times over. * This is loosing the data if {{checkpoint duration < metric collecting/reporting interval}}. Only a fraction of checkpoint's stats will be visible in the metric system. * I don't know how this approach will scale with larger jobs (parallelism > 20). How user/dev ops could actually visualize and analyze hundreds/thousands of this kind of metrics for a single checkpoint {{parallelism * number_of_tasks * 8 (number of metrics) == a lot}} (800 for parallelism 20 and 5 tasks)? From my experience if there is an issue with checkpoint, I need to look down to subtask level checkpoint stats to track down what is the problem and then often correlate numbers between different subtasks. I just don't see how this can be done with the current metric reporting scheme. I think in the long term, this should be exposed as something like [distributed tracing|https://newrelic.com/blog/how-to-relic/distributed-tracing-anomaly-detection] on the user side, and reported by Flink as a bunch of traces via OTEL. Maybe we should expand the current {{MetricReporter}} plugins to support that? As a stop gap solution, I would propose this FLINK-33071. was (Author: pnowojski): Hey, I've just upon this ticket as we are also looking into lack of those metrics. Should we really expose point-wise metrics as continuous ones? It's following the pattern of {{lastCheckpointDuration}}, but I don't think this is a good pattern: * this is wasteful if {{checkpoint duration >> metrics collecting/reporting interval}} * this is loosing the data if {{checkpoint duration < metric collecting/reporting interval}} * I don't know how this approach will scale with larger jobs (parallelism > 20). How user/dev ops could actually visualise and analyse hundreds/thousands of this kind of metrics for a single checkpoint {{parallelism * number_of_tasks * 8 (number of metrics) == a lot}} (800 for parallelism 20 and 5 tasks). I think in the long term, this should be exposed as something like [distributed tracing|https://newrelic.com/blog/how-to-relic/distributed-tracing-anomaly-detection] on the user side, and reported by Flink as a bunch of traces via OTEL. Maybe we should expand the current {{MetricReporter}} plugins to support that? As a stop gap solution, I would propose this FLINK-33071. > Expose Flink checkpoint details metrics > --------------------------------------- > > Key: FLINK-23411 > URL: https://issues.apache.org/jira/browse/FLINK-23411 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics > Affects Versions: 1.13.1, 1.12.4 > Reporter: Jun Qin > Assignee: Hangxiang Yu > Priority: Major > Labels: pull-request-available, stale-assigned > Fix For: 1.18.0 > > > The checkpoint metrics as shown in the Flink Web UI like the > sync/async/alignment/start delay are not exposed to the metrics system. This > makes problem investigation harder when Web UI is not enabled: those numbers > can not get in the DEBUG logs. I think we should see how we can expose > metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010)