[jira] [Commented] (HDFS-17055) Export HAState as a metric from Namenode for monitoring

ASF GitHub Bot (Jira) Thu, 22 Jun 2023 17:00:35 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736312#comment-17736312
 ]


ASF GitHub Bot commented on HDFS-17055:
---------------------------------------

xinglin commented on code in PR #5764:
URL: https://github.com/apache/hadoop/pull/5764#discussion_r1239137137


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java:
##########
@@ -2051,6 +2056,26 @@ synchronized HAServiceState getServiceState() {
     return state.getServiceState();
   }
 
+  /**
+   * Emit Namenode HA service state as an integer so that one can monitor NN HA
+   * state based on this metric.
+   *
+   * @return  0 when not fully started
+   *          1 for active or standalone (non-HA) NN
+   *          2 for standby
+   *          3 for observer
+   *

Review Comment:
   Searching codebase, it seems we would set a state to STOPPING state only in 
YARN ResourceManager HA. We are not using that state in HDFS.





> Export HAState as a metric from Namenode for monitoring
> -------------------------------------------------------
>
>                 Key: HDFS-17055
>                 URL: https://issues.apache.org/jira/browse/HDFS-17055
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>    Affects Versions: 3.4.0, 3.3.9
>            Reporter: Xing Lin
>            Assignee: Xing Lin
>            Priority: Minor
>              Labels: pull-request-available
>
> We'd like measure the uptime for Namenodes: percentage of time when we have 
> the active/standby/observer node available (up and running). We could monitor 
> the namenode from an external service, such as ZKFC. But that would require 
> the external service to be available 100% itself. And when this third-party 
> external monitoring service is down, we won't have info on whether our 
> Namenodes are still up.
> We propose to take a different approach: we will emit Namenode state directly 
> from namenode itself. Whenever we miss a data point for this metric, we 
> consider the corresponding namenode to be down/not available. In other words, 
> we assume the metric collection/monitoring infrastructure to be 100% reliable.
> One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, 
> which is currently used to emit all metrics for {_}NameNode.java{_}. However, 
> we don't think that is a good place to emit NameNode HAState. HAState is 
> stored in NameNode.java and we should directly emit it from NameNode.java. 
> Otherwise, we basically duplicate this info in two classes and we would have 
> to keep them in sync. Besides, _NameNodeMetrics_ class does not have a 
> reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ 
> is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}.
> We shouldn't emit HA state from FSNameSystem.java either, as it is 
> initialized from NameNode.java and all state transitions are implemented in 
> NameNode.java.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17055) Export HAState as a metric from Namenode for monitoring

Reply via email to