[ 
https://issues.apache.org/jira/browse/FLINK-22664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guokuai Huang updated FLINK-22664:
----------------------------------
    Description: 
In the current implementation of AbstractPrometheusReporter, metrics with the 
same scopedMetricName share the same metric Collector. At the same time, a 
HashMap named collectorsWithCountByMetricName is maintained to record the 
refrence counter of each Collector. Only when the refrence counter of one 
Collector becomes 0, it will be unregistered. 

Suppose we have a flink job with single chained operator, and *execution 
failover-strategy is set to region.*
 !Screen Shot 2021-05-14 at 2.51.04 PM.png!
 The following figure compares the growth of the number of metrics when this 
job runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM.

Each inflection point on the graph represents a region failover of the job. 
*For TaskManager with mutiple tasks, the number of metrics increases after 
region failover.*

This is a case I deliberately constructed to illustrate this problem. 
TaskManager only needs to restart part of the tasks during each region 
failover, that is to say, *the refrence counter of task's metric Collector will 
never become 0, so the metric Collector will not be unregistered.*

This problem has brought a lot of pressure to our Prometheus, please see if 
there is a good solution.

!Screen Shot 2021-05-14 at 5.40.22 PM.png!

 

  was:
In the current implementation of AbstractPrometheusReporter, metrics with the 
same scopedMetricName share the same metric Collector. At the same time, a 
HashMap named collectorsWithCountByMetricName is maintained to record the 
refrence counter of each Collector. Only when the refrence counter of one 
Collector becomes 0, it will be unregistered. 

Suppose we have a flink job with single chained operator, and *execution 
failover-strategy is set to region.*
 !Screen Shot 2021-05-14 at 2.51.04 PM.png!
 The following figure compares the growth of the number of metrics when this 
job runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM.

Each inflection point on the graph represents a region failover of the job. 
*For TaskManager with mutiple tasks, the number of metrics increases after 
region failover.*

This is a case I deliberately constructed to illustrate this problem. 
TaskManager only needs to restart part of the tasks during each region 
failover, that is to say, *the refrence counter of task's metric Collector will 
never become 0, so the metric Collector will not be unregistered.*

This problem has brought a lot of pressure to our Prometheus, please see if 
there is a good solution.

!Screen Shot 2021-05-14 at 5.30.51 PM.png!

 


> Task metrics are not properly unregistered during region failover
> -----------------------------------------------------------------
>
>                 Key: FLINK-22664
>                 URL: https://issues.apache.org/jira/browse/FLINK-22664
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Guokuai Huang
>            Priority: Major
>         Attachments: Screen Shot 2021-05-14 at 2.51.04 PM.png, Screen Shot 
> 2021-05-14 at 5.40.22 PM.png
>
>
> In the current implementation of AbstractPrometheusReporter, metrics with the 
> same scopedMetricName share the same metric Collector. At the same time, a 
> HashMap named collectorsWithCountByMetricName is maintained to record the 
> refrence counter of each Collector. Only when the refrence counter of one 
> Collector becomes 0, it will be unregistered. 
> Suppose we have a flink job with single chained operator, and *execution 
> failover-strategy is set to region.*
>  !Screen Shot 2021-05-14 at 2.51.04 PM.png!
>  The following figure compares the growth of the number of metrics when this 
> job runs on 2 TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM.
> Each inflection point on the graph represents a region failover of the job. 
> *For TaskManager with mutiple tasks, the number of metrics increases after 
> region failover.*
> This is a case I deliberately constructed to illustrate this problem. 
> TaskManager only needs to restart part of the tasks during each region 
> failover, that is to say, *the refrence counter of task's metric Collector 
> will never become 0, so the metric Collector will not be unregistered.*
> This problem has brought a lot of pressure to our Prometheus, please see if 
> there is a good solution.
> !Screen Shot 2021-05-14 at 5.40.22 PM.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to