Max Xie created HADOOP-17893:
---------------------------------
Summary: Improve PrometheusSink for Namenode and ResourceManager
Metrics
Key: HADOOP-17893
URL: https://issues.apache.org/jira/browse/HADOOP-17893
Project: Hadoop Common
Issue Type: Improvement
Components: metrics
Affects Versions: 3.4.0
Reporter: Max Xie
HADOOP-16398 added exporter for hadoop metrics to prometheus. But some of
metrics can't be exported validly. For example like these metrics,
1. queue metrics for ResourceManager
{code:java}
queue_metrics_max_capacity{queue="root.queue1",context="yarn",hostname="rm_host1"}
1// queue2's metric can't be exported
queue_metrics_max_capacity{queue="root.queue2",context="yarn",hostname="rm_host1"}
2
{code}
It always exported only one queue's metric because
PrometheusMetricsSink$metricLines only cache one metric if theses metrics have
the same name no matter these metrics has different metric tags.
2. rpc metrics for Namenode
Namenode may have rpc metrics with multi port like service-rpc. But because
the same reason as Issue 1, it wiil lost some rpc metrics if we use
PrometheusSink.
{code:java}
rpc_rpc_queue_time300s90th_percentile_latency{port="9005",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
0
rpc_rpc_queue_time300s90th_percentile_latency{port="9000",servername="ClientNamenodeProtocol",context="rpc",hostname="nnhost"}
0
{code}
3. TopMetrics for Namenode
org.apache.hadoop.hdfs.server.namenode.top.metrics.TopMetrics is a special
metric. And I think It is essentially a Summary metric type. TopMetrics record
name will according to different user and op , which means that these metric
will always exist in PrometheusMetricsSink$metricLines and it may cause the
risk of its memory leak. We e need to treat it special.
{code:java}
// invaild topmetric export
# TYPE
nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count
counter
nn_top_user_op_counts_window_ms_1500000_op_safemode_get_user_hadoop_client_ip_test_com_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
10
// apply these patch
# TYPE nn_top_user_op_counts_window_ms_1500000_count counter
nn_top_user_op_counts_window_ms_1500000_count{context="dfs",hostname="nn_host",op="safemode_get",user="hadoop/[email protected]"}
10{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]