[ 
https://issues.apache.org/jira/browse/FLINK-23411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763789#comment-17763789
 ] 

Piotr Nowojski edited comment on FLINK-23411 at 9/11/23 3:51 PM:
-----------------------------------------------------------------

Hey, I've just upon this ticket. Should we really expose point-wise metrics as 
continuous ones? It's following the pattern of {{lastCheckpointDuration}}, but 
I don't think this is a good pattern:
* This is wasteful if {{checkpoint duration >> metrics collecting/reporting 
interval}}. We will report the same values many times over.
* This is loosing the data if {{checkpoint duration < metric 
collecting/reporting interval}}. Only a fraction of checkpoint's stats will be 
visible in the metric system.
* I don't know how this approach will scale with larger jobs (parallelism > 
20). How user/dev ops could actually visualize and analyze hundreds/thousands 
of this kind of metrics for a single checkpoint {{parallelism * number_of_tasks 
* 8 (number of metrics) == a lot}} (800 for parallelism 20 and 5 tasks)? From 
my experience if there is an issue with checkpoint, I need to look down to 
subtask level checkpoint stats to track down what is the problem and then often 
correlate numbers between different subtasks. I just don't see how this can be 
done with the current metric reporting scheme.

I think in the long term, this should be exposed as something like [distributed 
tracing|https://newrelic.com/blog/how-to-relic/distributed-tracing-anomaly-detection]
 on the user side, and reported by Flink as a bunch of traces via OTEL. Maybe 
we should expand the current {{MetricReporter}} plugins to support that?

As a stop gap solution, I would propose this FLINK-33071.


was (Author: pnowojski):
Hey, I've just upon this ticket as we are also looking into lack of those 
metrics. Should we really expose point-wise metrics as continuous ones? It's 
following the pattern of {{lastCheckpointDuration}}, but I don't think this is 
a good pattern:
* this is wasteful if {{checkpoint duration >> metrics collecting/reporting 
interval}}
* this is loosing the data if {{checkpoint duration < metric 
collecting/reporting interval}}
* I don't know how this approach will scale with larger jobs (parallelism > 
20). How user/dev ops could actually visualise and analyse hundreds/thousands 
of this kind of metrics for a single checkpoint {{parallelism * number_of_tasks 
* 8 (number of metrics) == a lot}} (800 for parallelism 20 and 5 tasks). 

I think in the long term, this should be exposed as something like [distributed 
tracing|https://newrelic.com/blog/how-to-relic/distributed-tracing-anomaly-detection]
 on the user side, and reported by Flink as a bunch of traces via OTEL. Maybe 
we should expand the current {{MetricReporter}} plugins to support that?

As a stop gap solution, I would propose this FLINK-33071.

> Expose Flink checkpoint details metrics
> ---------------------------------------
>
>                 Key: FLINK-23411
>                 URL: https://issues.apache.org/jira/browse/FLINK-23411
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics
>    Affects Versions: 1.13.1, 1.12.4
>            Reporter: Jun Qin
>            Assignee: Hangxiang Yu
>            Priority: Major
>              Labels: pull-request-available, stale-assigned
>             Fix For: 1.18.0
>
>
> The checkpoint metrics as shown in the Flink Web UI like the 
> sync/async/alignment/start delay are not exposed to the metrics system. This 
> makes problem investigation harder when Web UI is not enabled: those numbers 
> can not get in the DEBUG logs. I think we should see how we can expose 
> metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to