Max  Xie created HADOOP-17893:
---------------------------------

             Summary: Improve PrometheusSink for Namenode and ResourceManager 
Metrics
                 Key: HADOOP-17893
                 URL: https://issues.apache.org/jira/browse/HADOOP-17893
             Project: Hadoop Common
          Issue Type: Improvement
          Components: metrics
    Affects Versions: 3.4.0
            Reporter: Max  Xie


HADOOP-16398 added exporter for hadoop metrics to prometheus. But some of 
metrics can't be exported  validly. For example like these metrics, 

1.  queue metrics for ResourceManager
{code:java}
queue_metrics_max_capacity{queue="root.queue1",context="yarn",hostname="rm_host1"}
 1// queue2's metric can't be exported 
queue_metrics_max_capacity{queue="root.queue2",context="yarn",hostname="rm_host1"}
 2
{code}
It always exported  only one queue's metric because 
PrometheusMetricsSink$metricLines only cache one metric  if theses metrics have 
the same name no matter these metrics has different metric tags.

 

2. rpc metrics for Namenode

Namenode may have rpc metrics with multi port like service-rpc. But because  
the same reason  as  Issue 1, it wiil lost some rpc metrics if we use 
PrometheusSink.
{code:java}
rpc_rpc_queue_time300s90th_percentile_latency{port="9005",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
 0
rpc_rpc_queue_time300s90th_percentile_latency{port="9000",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
 0
{code}
3. TopMetrics for Namenode

org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics is a special 
metric. And I think It is essentially a Summary metric type. TopMetrics record 
name will according to different user and op ,  which means that these metric 
will always exist in PrometheusMetricsSink$metricLines and it may cause the 
risk of its memory leak. We e need to treat it special. 
{code:java}
// invaild topmetric export
# TYPE 
nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count
 counter
nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
 10

// apply these patch 
# TYPE nn_top_user_op_counts_window_ms_1500000_count counter
nn_top_user_op_counts_window_ms_1500000_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
 10{code}
 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to