[
https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736547#comment-17736547
]
ASF GitHub Bot commented on HDFS-17055:
---------------------------------------
xinglin commented on PR #5764:
URL: https://github.com/apache/hadoop/pull/5764#issuecomment-1604512082
TestObserverNode unit test failure does not seem to be related with change
in this PR. The error is connection error in RPC.
```
[ERROR]
testMkdirsRaceWithObserverRead(org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode)
Time elapsed: 317.089 s <<< ERROR!
java.net.ConnectException: Call From 038ad877ed75/172.17.0.2 to
localhost:11836 failed on connection exception: java.net.ConnectException:
Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor115.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:948)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:863)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1588)
at org.apache.hadoop.ipc.Client.call(Client.java:1529)
at org.apache.hadoop.ipc.Client.call(Client.java:1426)
```
passed all unit tests when running at my laptop as well.
```
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
[INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
49.418 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestObserverNode
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 20, Failures: 0, Errors: 0, Skipped: 0
[INFO]
------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 01:27 min
[INFO] Finished at: 2023-06-23T09:18:26-07:00
```
trigger another build.
> Export HAState as a metric from Namenode for monitoring
> -------------------------------------------------------
>
> Key: HDFS-17055
> URL: https://issues.apache.org/jira/browse/HDFS-17055
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs
> Affects Versions: 3.4.0, 3.3.9
> Reporter: Xing Lin
> Assignee: Xing Lin
> Priority: Minor
> Labels: pull-request-available
>
> We'd like measure the uptime for Namenodes: percentage of time when we have
> the active/standby/observer node available (up and running). We could monitor
> the namenode from an external service, such as ZKFC. But that would require
> the external service to be available 100% itself. And when this third-party
> external monitoring service is down, we won't have info on whether our
> Namenodes are still up.
> We propose to take a different approach: we will emit Namenode state directly
> from namenode itself. Whenever we miss a data point for this metric, we
> consider the corresponding namenode to be down/not available. In other words,
> we assume the metric collection/monitoring infrastructure to be 100% reliable.
> One implementation detail: in hadoop, we have the _NameNodeMetrics_ class,
> which is currently used to emit all metrics for {_}NameNode.java{_}. However,
> we don't think that is a good place to emit NameNode HAState. HAState is
> stored in NameNode.java and we should directly emit it from NameNode.java.
> Otherwise, we basically duplicate this info in two classes and we would have
> to keep them in sync. Besides, _NameNodeMetrics_ class does not have a
> reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_
> is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}.
> We shouldn't emit HA state from FSNameSystem.java either, as it is
> initialized from NameNode.java and all state transitions are implemented in
> NameNode.java.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]