[jira] [Commented] (FLINK-22664) Task metrics are not properly unregistered during region failover

Guokuai Huang (Jira) Sun, 16 May 2021 21:09:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-22664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345860#comment-17345860
 ]


Guokuai Huang commented on FLINK-22664:
---------------------------------------

Sorry, this problem is caused by our secondary development. In order to 
facilitate the management of the metrics of the flink job running on yarn, we 
modified AbstractPrometheusReporter and added yarn applicaiton_id as a metric 
dimension. When removing metric, this mtric dimension was not added, resulting 
in the unsuccessful remove.

> Task metrics are not properly unregistered during region failover
> -----------------------------------------------------------------
>
>                 Key: FLINK-22664
>                 URL: https://issues.apache.org/jira/browse/FLINK-22664
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Metrics
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Guokuai Huang
>            Priority: Major
>         Attachments: Screen Shot 2021-05-14 at 2.51.04 PM.png, Screen Shot 
> 2021-05-14 at 5.40.22 PM.png
>
>
> In the current implementation of AbstractPrometheusReporter, metrics with the 
> same scopedMetricName share the same metric Collector. At the same time, a 
> HashMap named collectorsWithCountByMetricName is maintained to record the 
> refrence counter of each Collector. Only when the refrence counter of one 
> Collector becomes 0, it will be unregistered. 
> Suppose we have a flink job with single chained operator, and *execution 
> failover-strategy is set to region.*
>  !Screen Shot 2021-05-14 at 2.51.04 PM.png!
>  The following figure compares the number of metrics when this job runs on 2 
> TaskManager with 1 slots/TM and 1 TaskManager with 2 slots/TM after region 
> failover.
> Each inflection point on the graph represents a region failover. *For 
> TaskManager with multiple tasks(slots), the number of metrics increases after 
> region failover.*
> This is a case I deliberately constructed to illustrate this problem. 
> TaskManager only needs to restart part of the tasks during each region 
> failover, that is to say, *the refrence counter of task's metric Collector 
> will never become 0, so the metric Collector will not be unregistered.*
> This problem has brought a lot of pressure to our Prometheus, please see if 
> there is a good solution.
> !Screen Shot 2021-05-14 at 5.40.22 PM.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-22664) Task metrics are not properly unregistered during region failover

Reply via email to