[
https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101725#comment-14101725
]
Zesheng Wu commented on HDFS-6827:
----------------------------------
Thanks [~stack].
bq. This issue looks a little ugly. NameNodes stuck in standby mode? This
production? What did it look like?
Yes, both NameNodes stuck in standby mode, and the hbase cluster over it
coundn't read/write any more.
We can reproduce the issue in the following way:
1. Change the sleep time of {{Client#handleConnectionFailure()}} longer
{code}
try {
Thread.sleep(action.delayMillis); // default is 1s, can change to 10s
or longer
} catch (InterruptedException e) {
throw (IOException)new InterruptedIOException("Interrupted: action="
+ action + ", retry policy=" + connectionRetryPolicy).initCause(e);
}
{code}
2. Restart the active NameNode quickly, and ensure that the NameNode starts
successfully before ZKFC retrying connect.
> Both NameNodes could be in STANDBY State due to HealthMonitor not aware of
> the target's status changing sometimes
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-6827
> URL: https://issues.apache.org/jira/browse/HDFS-6827
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.4.1
> Reporter: Zesheng Wu
> Assignee: Zesheng Wu
> Priority: Critical
> Attachments: HDFS-6827.1.patch
>
>
> In our production cluster, we encounter a scenario like this: ANN crashed due
> to write journal timeout, and was restarted by the watchdog automatically,
> but after restarting both of the NNs are standby.
> Following is the logs of the scenario:
> # NN1 is down due to write journal timeout:
> {color:red}2014-08-03,23:02:02,219{color} INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
> # ZKFC1 detected "connection reset by peer"
> {color:red}2014-08-03,23:02:02,560{color} ERROR
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:[email protected] (auth:KERBEROS) cause:java.io.IOException:
> {color:red}Connection reset by peer{color}
> # NN1 wat restarted successfully by the watchdog:
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> Web-server up at: xx:13201
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server
> Responder: starting
> {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server:
> IPC Server listener on 13200: starting
> 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean
> thread started!
> 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> Registered DFSClientInformation MBean
> 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> NameNode up at: xx/xx:13200
> 2014-08-03,23:02:08,744 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services
> required for standby state
> # ZKFC1 retried the connection and considered NN1 was healthy
> {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client:
> Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1
> SECONDS)
> # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the
> failover, as a result, both NNs were standby.
> The root cause of this bug is that NN is restarted too quickly and ZKFC
> health monitor doesn't realize that.
--
This message was sent by Atlassian JIRA
(v6.2#6252)