[jira] [Updated] (HADOOP-17893) Improve PrometheusSink for Namenode and ResourceManager Metrics

Max Xie (Jira) Sun, 05 Sep 2021 04:28:06 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Max  Xie updated HADOOP-17893:
------------------------------
    Description: 
HADOOP-16398 added exporter for hadoop metrics to prometheus. But some of 
metrics can't be exported  validly. For example like these metrics, 

1.  queue metrics for ResourceManager
{code:java}
queue_metrics_max_capacity{queue="root.queue1",context="yarn",hostname="rm_host1"}
 1
// queue2's metric can't be exported 
queue_metrics_max_capacity{queue="root.queue2",context="yarn",hostname="rm_host1"}
 2
{code}
It always exported  only one queue's metric because 
PrometheusMetricsSink$metricLines only cache one metric  if theses metrics have 
the same name no matter these metrics has different metric tags.

 

2. rpc metrics for Namenode

Namenode may have rpc metrics with multi port like service-rpc. But because  
the same reason  as  Issue 1, it wiil lost some rpc metrics if we use 
PrometheusSink.
{code:java}
rpc_rpc_queue_time300s90th_percentile_latency{port="9000",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
 0
// rpc port=9005 metric can't be exported 
rpc_rpc_queue_time300s90th_percentile_latency{port="9005",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
 0
{code}
3. TopMetrics for Namenode

org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics is a special 
metric. And I think It is essentially a Summary metric type. TopMetrics record 
name will according to different user and op ,  which means that these metric 
will always exist in PrometheusMetricsSink$metricLines and it may cause the 
risk of its memory leak. We e need to treat it special. 
{code:java}
// invaild topmetric export
# TYPE 
nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count
 counter
nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
 10

// apply these patch 
# TYPE nn_top_user_op_counts_window_ms_1500000_count counter
nn_top_user_op_counts_window_ms_1500000_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
 10{code}

  was:
HADOOP-16398 added exporter for hadoop metrics to prometheus. But some of 
metrics can't be exported  validly. For example like these metrics, 

1.  queue metrics for ResourceManager
{code:java}
queue_metrics_max_capacity{queue="root.queue1",context="yarn",hostname="rm_host1"}
 1// queue2's metric can't be exported 
queue_metrics_max_capacity{queue="root.queue2",context="yarn",hostname="rm_host1"}
 2
{code}
It always exported  only one queue's metric because 
PrometheusMetricsSink$metricLines only cache one metric  if theses metrics have 
the same name no matter these metrics has different metric tags.

 

2. rpc metrics for Namenode

Namenode may have rpc metrics with multi port like service-rpc. But because  
the same reason  as  Issue 1, it wiil lost some rpc metrics if we use 
PrometheusSink.
{code:java}
rpc_rpc_queue_time300s90th_percentile_latency{port="9005",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
 0
rpc_rpc_queue_time300s90th_percentile_latency{port="9000",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
 0
{code}
3. TopMetrics for Namenode

org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics is a special 
metric. And I think It is essentially a Summary metric type. TopMetrics record 
name will according to different user and op ,  which means that these metric 
will always exist in PrometheusMetricsSink$metricLines and it may cause the 
risk of its memory leak. We e need to treat it special. 
{code:java}
// invaild topmetric export
# TYPE 
nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count
 counter
nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
 10

// apply these patch 
# TYPE nn_top_user_op_counts_window_ms_1500000_count counter
nn_top_user_op_counts_window_ms_1500000_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
 10{code}


> Improve PrometheusSink for Namenode and ResourceManager Metrics
> ---------------------------------------------------------------
>
>                 Key: HADOOP-17893
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17893
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: metrics
>    Affects Versions: 3.4.0
>            Reporter: Max  Xie
>            Priority: Minor
>
> HADOOP-16398 added exporter for hadoop metrics to prometheus. But some of 
> metrics can't be exported  validly. For example like these metrics, 
> 1.  queue metrics for ResourceManager
> {code:java}
> queue_metrics_max_capacity{queue="root.queue1",context="yarn",hostname="rm_host1"}
>  1
> // queue2's metric can't be exported 
> queue_metrics_max_capacity{queue="root.queue2",context="yarn",hostname="rm_host1"}
>  2
> {code}
> It always exported  only one queue's metric because 
> PrometheusMetricsSink$metricLines only cache one metric  if theses metrics 
> have the same name no matter these metrics has different metric tags.
>  
> 2. rpc metrics for Namenode
> Namenode may have rpc metrics with multi port like service-rpc. But because  
> the same reason  as  Issue 1, it wiil lost some rpc metrics if we use 
> PrometheusSink.
> {code:java}
> rpc_rpc_queue_time300s90th_percentile_latency{port="9000",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
>  0
> // rpc port=9005 metric can't be exported 
> rpc_rpc_queue_time300s90th_percentile_latency{port="9005",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
>  0
> {code}
> 3. TopMetrics for Namenode
> org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics is a special 
> metric. And I think It is essentially a Summary metric type. TopMetrics 
> record name will according to different user and op ,  which means that these 
> metric will always exist in PrometheusMetricsSink$metricLines and it may 
> cause the risk of its memory leak. We e need to treat it special. 
> {code:java}
> // invaild topmetric export
> # TYPE 
> nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count
>  counter
> nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
>  10
> // apply these patch 
> # TYPE nn_top_user_op_counts_window_ms_1500000_count counter
> nn_top_user_op_counts_window_ms_1500000_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
>  10{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-17893) Improve PrometheusSink for Namenode and ResourceManager Metrics

Reply via email to