xinglin opened a new pull request, #5764:
URL: https://github.com/apache/hadoop/pull/5764

   
   
   ### Description of PR
   We'd like measure the uptime for Namenodes: percentage of time when we have 
the active/standby/observer node available (up and running). We could monitor 
the namenode from an external service, such as ZKFC. But that would require the 
external service to be available 100% itself. And when this third-party 
external monitoring service is down, we won't have info on whether our 
Namenodes are still up.
   
   We propose to take a different approach: we will emit Namenode state 
directly from namenode itself. Whenever we miss a data point for this metric, 
we consider the corresponding namenode to be down/not available. In other 
words, we assume the metric collection/monitoring infrastructure to be 100% 
reliable.
   
   One implementation detail: in hadoop, we have the NameNodeMetrics class, 
which is currently used to emit all metrics for NameNode.java. However, we 
don't think that is a good place to emit NameNode HAState. HAState is stored in 
NameNode.java and we should directly emit it from NameNode.java. Otherwise, we 
basically duplicate this info in two classes and we would have to keep them in 
sync. Besides, NameNodeMetrics class does not have a reference to the NameNode 
object which it belongs to. An NameNodeMetrics is created by a static function 
initMetrics() in NameNode.java.
   
   We shouldn't emit HA state from FSNameSystem.java either, as it is 
initialized from NameNode.java and all state transitions are implemented in 
NameNode.java.
   
   ### How was this patch tested?
   mvn test -Dtest="TestHAMetrics"
   ```
   [INFO] -------------------------------------------------------
   [INFO]  T E S T S
   [INFO] -------------------------------------------------------
   [INFO] Running org.apache.hadoop.hdfs.server.namenode.ha.TestHAMetrics
   [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
17.907 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestHAMetrics
   [INFO]
   [INFO] Results:
   [INFO]
   [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to