Zesheng Wu created HDFS-6827: -------------------------------- Summary: NameNode double standby Key: HDFS-6827 URL: https://issues.apache.org/jira/browse/HDFS-6827 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.4.1 Reporter: Zesheng Wu Assignee: Zesheng Wu
In our production cluster, we encounter a scenario like this: ANN crashed due to write journal timeout, and was restarted by the watchdog automatically, but after restarting both of the NNs are standby. Following is the logs of the scenario: # NN1 is down due to write journal timeout: {color:red}2014-08-03,23:02:02,219{color} INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG # ZKFC1 detected "connection reset by peer" {color:red}2014-08-03,23:02:02,560{color} ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:xx@xx.HADOOP (auth:KERBEROS) cause:java.io.IOException: {color:red}Connection reset by peer{color} # NN1 wat restarted successfully by the watchdog: 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at: xx:13201 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: IPC Server listener on 13200: starting 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean thread started! 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Registered DFSClientInformation MBean 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode up at: xx/xx:13200 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state # ZKFC1 retried the connection and considered NN1 was healthy {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS) # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the failover, as a result, both NNs were standby. The root cause of this bug is that NN is restarted too quickly and ZKFC health monitor doesn't realize that. -- This message was sent by Atlassian JIRA (v6.2#6252)