[ 
https://issues.apache.org/jira/browse/FLINK-22664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guokuai Huang updated FLINK-22664:
----------------------------------
    Description: 
In the current implementation of AbstractPrometheusReporter, metrics with the 
same scopedMetricName share the same metric Collector. At the same time, a 
HashMap named collectorsWithCountByMetricName is maintained to record the 
refrence counter of each Collector. Only when the refrence counter of one 
Collector becomes 0, it will be unregistered. 

Suppose we have a flink job with single chained operator, and execution 
failover-strategy is set to region.
 !Screen Shot 2021-05-14 at 2.51.04 PM.png!
 The following figure compares the growth of the number of metrics when this 
job runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM.

Each inflection point on the graph represents a region failover of the job. For 
TaskManager with mutiple tasks, the number of metrics increases after rigion 
restart.

This is a case I deliberately constructed to illustrate this problem. 
TaskManager only needs to restart part of the tasks during each region 
failover, that is to say, the refrence counter of task's metric Collector will 
never become 0, so the metric Collector will not be unregistered.

This problem has brought a lot of pressure to our Prometheus, please see if 
there is a good solution.

!Screen Shot 2021-05-14 at 2.36.30 PM.png!

 

  was:
In the current implementation of AbstractPrometheusReporter, metrics with the 
same scopedMetricName share the same metric Collector. At the same time, a 
HashMap named collectorsWithCountByMetricName is maintained to record the 
refrence counter of each Collector. Only when the refrence counter of one 
Collector becomes 0, it will be unregistered. 

Suppose we have a flink job with single chained operator, and execution 
failover-strategy is set to region.
!Screen Shot 2021-05-14 at 2.51.04 PM.png!
The following figure compares the growth of the number of metrics when this job 
runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM.


!Screen Shot 2021-05-14 at 2.36.30 PM.png!

Each inflection point on the graph represents a region failover of the job. For 
TaskManager with mutiple tasks, the number of metrics increases after rigion 
restart.

This is a case I deliberately constructed to illustrate this problem. 
TaskManager only needs to restart part of the tasks during each region 
failover, that is to say, the refrence counter of task's metric Collector will 
never become 0, so the metric Collector will not be unregistered.


> Task metrics are not properly unregistered during region failover
> -----------------------------------------------------------------
>
>                 Key: FLINK-22664
>                 URL: https://issues.apache.org/jira/browse/FLINK-22664
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Guokuai Huang
>            Priority: Major
>         Attachments: Screen Shot 2021-05-14 at 2.36.30 PM.png, Screen Shot 
> 2021-05-14 at 2.51.04 PM.png
>
>
> In the current implementation of AbstractPrometheusReporter, metrics with the 
> same scopedMetricName share the same metric Collector. At the same time, a 
> HashMap named collectorsWithCountByMetricName is maintained to record the 
> refrence counter of each Collector. Only when the refrence counter of one 
> Collector becomes 0, it will be unregistered. 
> Suppose we have a flink job with single chained operator, and execution 
> failover-strategy is set to region.
>  !Screen Shot 2021-05-14 at 2.51.04 PM.png!
>  The following figure compares the growth of the number of metrics when this 
> job runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM.
> Each inflection point on the graph represents a region failover of the job. 
> For TaskManager with mutiple tasks, the number of metrics increases after 
> rigion restart.
> This is a case I deliberately constructed to illustrate this problem. 
> TaskManager only needs to restart part of the tasks during each region 
> failover, that is to say, the refrence counter of task's metric Collector 
> will never become 0, so the metric Collector will not be unregistered.
> This problem has brought a lot of pressure to our Prometheus, please see if 
> there is a good solution.
> !Screen Shot 2021-05-14 at 2.36.30 PM.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to