[ https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103278#comment-14103278 ]
Zesheng Wu commented on HDFS-6827: ---------------------------------- bq. As I described in the issue description, our NN came up inside about 6 seconds. 1 second for Client#handleConnectionFailure() sleep, the other 5 seconds for some unknown reasons, maybe GC or network problems, we haven't found direct evidences. Sorry, this description is not very accurate. Our NN came up inside about 6 seconds. And the ZKFC retried connection exactly after NN starting successfully. There are about 6 seconds between ZKFC detected 'Connection reset by peer' and reconnected NN successfully. 1 second for {{Client#handleConnectionFailure()}} sleep is definitely, the other 5 seconds for some unknown reasons, maybe GC or network problems, we haven't found direct evidences. > Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the > target's status changing sometimes > -------------------------------------------------------------------------------------------------------------- > > Key: HDFS-6827 > URL: https://issues.apache.org/jira/browse/HDFS-6827 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Affects Versions: 2.4.1 > Reporter: Zesheng Wu > Assignee: Zesheng Wu > Priority: Critical > Attachments: HDFS-6827.1.patch > > > In our production cluster, we encounter a scenario like this: ANN crashed due > to write journal timeout, and was restarted by the watchdog automatically, > but after restarting both of the NNs are standby. > Following is the logs of the scenario: > # NN1 is down due to write journal timeout: > {color:red}2014-08-03,23:02:02,219{color} INFO > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG > # ZKFC1 detected "connection reset by peer" > {color:red}2014-08-03,23:02:02,560{color} ERROR > org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException > as:xx@xx.HADOOP (auth:KERBEROS) cause:java.io.IOException: > {color:red}Connection reset by peer{color} > # NN1 wat restarted successfully by the watchdog: > 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: > Web-server up at: xx:13201 > 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server > Responder: starting > {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: > IPC Server listener on 13200: starting > 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean > thread started! > 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: > Registered DFSClientInformation MBean > 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: > NameNode up at: xx/xx:13200 > 2014-08-03,23:02:08,744 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services > required for standby state > # ZKFC1 retried the connection and considered NN1 was healthy > {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: > Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 > SECONDS) > # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the > failover, as a result, both NNs were standby. > The root cause of this bug is that NN is restarted too quickly and ZKFC > health monitor doesn't realize that. -- This message was sent by Atlassian JIRA (v6.2#6252)