Guokuai Huang created FLINK-22664:
-------------------------------------
Summary: Task metrics are not properly unregistered during region
failover
Key: FLINK-22664
URL: https://issues.apache.org/jira/browse/FLINK-22664
Project: Flink
Issue Type: Bug
Components: Runtime / Metrics
Affects Versions: 1.12.0, 1.11.0
Reporter: Guokuai Huang
Attachments: Screen Shot 2021-05-14 at 2.36.30 PM.png, Screen Shot
2021-05-14 at 2.51.04 PM.png
In the current implementation of AbstractPrometheusReporter, metrics with the
same scopedMetricName share the same metric Collector. At the same time, a
HashMap named collectorsWithCountByMetricName is maintained to record the
refrence counter of each Collector. Only when the refrence counter of one
Collector becomes 0, it will be unregistered.
Suppose we have a flink job with single chained operator, and execution
failover-strategy is set to region.
!Screen Shot 2021-05-14 at 2.51.04 PM.png!
The following figure compares the growth of the number of metrics when this job
runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM.
!Screen Shot 2021-05-14 at 2.36.30 PM.png!
Each inflection point on the graph represents a region failover of the job. For
TaskManager with mutiple tasks, the number of metrics increases after rigion
restart.
This is a case I deliberately constructed to illustrate this problem.
TaskManager only needs to restart part of the tasks during each region
failover, that is to say, the refrence counter of task's metric Collector will
never become 0, so the metric Collector will not be unregistered.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)