[ 
https://issues.apache.org/jira/browse/HDFS-16902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683194#comment-17683194
 ] 

ASF GitHub Bot commented on HDFS-16902:
---------------------------------------

virajjasani commented on code in PR #5334:
URL: https://github.com/apache/hadoop/pull/5334#discussion_r1093941288


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java:
##########
@@ -202,6 +202,7 @@ private String getNameNodeAddress() {
   Map<String, String> getActorInfoMap() {
     final Map<String, String> info = new HashMap<String, String>();
     info.put("NamenodeAddress", getNameNodeAddress());
+    info.put("NamenodeState", state.toString());

Review Comment:
   Actually that is better, since that is the main purpose we want to debug. 
Let me make this change.





> Add Namenode status to BPServiceActor metrics and improve logging in 
> offerservice
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-16902
>                 URL: https://issues.apache.org/jira/browse/HDFS-16902
>             Project: Hadoop HDFS
>          Issue Type: Task
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>              Labels: pull-request-available
>
> Recently came across an k8s environment where randomly some datanode pods are 
> not able to stay connected to all namenode pods (e.g. last heartbeat time 
> stays higher than 2 hr sometimes). When any standby namenode becomes active, 
> any datanode that is not heartbeating to it for quite sometime would not be 
> able to send any further block reports, leading to missing replicas 
> immediately after namenode failover, which could only be resolved with 
> datanode pod restart.
> While the issue seems env specific, BPServiceActor's offer service could use 
> some logging improvements. It is also good to get namenode status exposed 
> with BPServiceActorInfo to identify any lags from datanode side in 
> recognizing updated Active namenode status with heartbeats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to