[
https://issues.apache.org/jira/browse/FLINK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763789#comment-17763789
]
Piotr Nowojski edited comment on FLINK-23411 at 9/11/23 3:51 PM:
-----------------------------------------------------------------
Hey, I've just upon this ticket. Should we really expose point-wise metrics as
continuous ones? It's following the pattern of {{lastCheckpointDuration}}, but
I don't think this is a good pattern:
* This is wasteful if {{checkpoint duration >> metrics collecting/reporting
interval}}. We will report the same values many times over.
* This is loosing the data if {{checkpoint duration < metric
collecting/reporting interval}}. Only a fraction of checkpoint's stats will be
visible in the metric system.
* I don't know how this approach will scale with larger jobs (parallelism >
20). How user/dev ops could actually visualize and analyze hundreds/thousands
of this kind of metrics for a single checkpoint {{parallelism * number_of_tasks
* 8 (number of metrics) == a lot}} (800 for parallelism 20 and 5 tasks)? From
my experience if there is an issue with checkpoint, I need to look down to
subtask level checkpoint stats to track down what is the problem and then often
correlate numbers between different subtasks. I just don't see how this can be
done with the current metric reporting scheme.
I think in the long term, this should be exposed as something like [distributed
tracing|https://newrelic.com/blog/how-to-relic/distributed-tracing-anomaly-detection]
on the user side, and reported by Flink as a bunch of traces via OTEL. Maybe
we should expand the current {{MetricReporter}} plugins to support that?
As a stop gap solution, I would propose this FLINK-33071.
was (Author: pnowojski):
Hey, I've just upon this ticket as we are also looking into lack of those
metrics. Should we really expose point-wise metrics as continuous ones? It's
following the pattern of {{lastCheckpointDuration}}, but I don't think this is
a good pattern:
* this is wasteful if {{checkpoint duration >> metrics collecting/reporting
interval}}
* this is loosing the data if {{checkpoint duration < metric
collecting/reporting interval}}
* I don't know how this approach will scale with larger jobs (parallelism >
20). How user/dev ops could actually visualise and analyse hundreds/thousands
of this kind of metrics for a single checkpoint {{parallelism * number_of_tasks
* 8 (number of metrics) == a lot}} (800 for parallelism 20 and 5 tasks).
I think in the long term, this should be exposed as something like [distributed
tracing|https://newrelic.com/blog/how-to-relic/distributed-tracing-anomaly-detection]
on the user side, and reported by Flink as a bunch of traces via OTEL. Maybe
we should expand the current {{MetricReporter}} plugins to support that?
As a stop gap solution, I would propose this FLINK-33071.
> Expose Flink checkpoint details metrics
> ---------------------------------------
>
> Key: FLINK-23411
> URL: https://issues.apache.org/jira/browse/FLINK-23411
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Metrics
> Affects Versions: 1.13.1, 1.12.4
> Reporter: Jun Qin
> Assignee: Hangxiang Yu
> Priority: Major
> Labels: pull-request-available, stale-assigned
> Fix For: 1.18.0
>
>
> The checkpoint metrics as shown in the Flink Web UI like the
> sync/async/alignment/start delay are not exposed to the metrics system. This
> makes problem investigation harder when Web UI is not enabled: those numbers
> can not get in the DEBUG logs. I think we should see how we can expose
> metrics.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)