[jira] [Work logged] (HADOOP-17804) Prometheus metrics only include the last set of labels

ASF GitHub Bot (Jira) Wed, 01 Sep 2021 07:05:18 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17804?focusedWorklogId=645280&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-645280
 ]


ASF GitHub Bot logged work on HADOOP-17804:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Sep/21 14:04
            Start Date: 01/Sep/21 14:04
    Worklog Time Spent: 10m 
      Work Description: Kimahriman opened a new pull request #3369:
URL: https://github.com/apache/hadoop/pull/3369


   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   Fixes a bug with the Prometheus metrics sink where metrics were deduped on 
their name alone, and didn't include the tag values for deduplication purposes. 
Prometheus metrics are uniquely identified by their name and labels, so several 
metrics were just getting dropped. Specifically things like RPC metrics were 
only including one of the servers/ports per metric type, and Yarn queue metrics 
only included metrics for one queue. 
   
   Additionally, because of the "push" nature of Hadoop metrics, this would end 
up creating a lot of extra metrics for things where the tags can change over 
time but they still actually mean the same thing. For example, the `hastate` of 
namenode metrics can change, but you really only want the most recent one. To 
address this, I changed it to only expose metrics after a `flush` call, and to 
start fresh after each `flush` call. This prevents old metrics from hanging 
around and constantly being exposed until the service is restarted.
   
   There are still some "bad" tags that are exposed which can lead to multiple 
Prometheus series being created when really they are the same thing. However, 
these can be dealt with on the Prometheus side, ignoring certain labels, rather 
than trying to hard code all the bad tags on the Hadoop side.
   
   I don't _think_ there should be any threading/race conditions with 
publishing metrics, since the publish metrics methods are synchronized.
   
   Also adds the help line to the output.
   
   ### How was this patch tested?
   New unit tests.
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] ~~Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?~~
   - [ ] ~~If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?~~
   - [ ] ~~If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?~~
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 645280)
    Remaining Estimate: 0h
            Time Spent: 10m

> Prometheus metrics only include the last set of labels
> ------------------------------------------------------
>
>                 Key: HADOOP-17804
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17804
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 3.3.1
>            Reporter: Adam Binford
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> A prometheus endpoint was added in 
> https://issues.apache.org/jira/browse/HADOOP-16398, but the logic that puts 
> them into a map based on the "key" incorrectly hides any metrics with the 
> same key but different labels. The relevant code is here: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/sink/PrometheusMetricsSink.java#L55|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/metrics2/sink/PrometheusMetricsSink.java#L55.]
> The labels/tags need to be taken into account, as different tags mean 
> different metrics. For example, I came across this while trying to scrape 
> metrics for all the queues in our scheduler. Only the last queue is included 
> because all the metrics have the same "key" but a different "queue" label/tag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-17804) Prometheus metrics only include the last set of labels

Reply via email to