[
https://issues.apache.org/jira/browse/FLINK-22664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359909#comment-17359909
]
Guokuai Huang commented on FLINK-22664:
---------------------------------------
[~trohrmann] Yes, sorry for the late reply. After I found the problem, I
explained it and closed the issue.
> Task metrics are not properly unregistered during region failover
> -----------------------------------------------------------------
>
> Key: FLINK-22664
> URL: https://issues.apache.org/jira/browse/FLINK-22664
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Metrics
> Affects Versions: 1.11.0, 1.12.0
> Reporter: Guokuai Huang
> Priority: Major
> Attachments: Screen Shot 2021-05-14 at 2.51.04 PM.png, Screen Shot
> 2021-05-14 at 5.40.22 PM.png
>
>
> In the current implementation of AbstractPrometheusReporter, metrics with the
> same scopedMetricName share the same metric Collector. At the same time, a
> HashMap named collectorsWithCountByMetricName is maintained to record the
> refrence counter of each Collector. Only when the refrence counter of one
> Collector becomes 0, it will be unregistered.
> Suppose we have a flink job with single chained operator, and *execution
> failover-strategy is set to region.*
> !Screen Shot 2021-05-14 at 2.51.04 PM.png!
> The following figure compares the number of metrics when this job runs on 2
> TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM after region
> failover.
> Each inflection point on the graph represents a region failover. *For
> TaskManager with multiple tasks(slots), the number of metrics increases after
> region failover.*
> This is a case I deliberately constructed to illustrate this problem.
> TaskManager only needs to restart part of the tasks during each region
> failover, that is to say, *the refrence counter of task's metric Collector
> will never become 0, so the metric Collector will not be unregistered.*
> This problem has brought a lot of pressure to our Prometheus, please see if
> there is a good solution.
> !Screen Shot 2021-05-14 at 5.40.22 PM.png!
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)