[
https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103515#comment-14103515
]
Vinayakumar B commented on HDFS-6827:
-------------------------------------
bq. The root cause of this issue is that ANN's ZKFC isn't aware that ANN is
retarted and doesn't trigger failover.
This is fine. After HADOOP-10251, ZKFC will trigger failover based on the state
check from the NameNode.
for ex: Active ZKFC expects its NN to be in ACTIVE state, but after restart of
NN its STANDBY, (ZKFC need not know NN is restrarted and NN doesnot
automatically comes to ACTIVE). So ZKFC quits the election, and re-election
will happen for the new Active.
> Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the
> target's status changing sometimes
> --------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-6827
> URL: https://issues.apache.org/jira/browse/HDFS-6827
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.4.1
> Reporter: Zesheng Wu
> Assignee: Zesheng Wu
> Priority: Critical
> Attachments: HDFS-6827.1.patch
>
>
> In our production cluster, we encounter a scenario like this: ANN crashed due
> to write journal timeout, and was restarted by the watchdog automatically,
> but after restarting both of the NNs are standby.
> Following is the logs of the scenario:
> # NN1 is down due to write journal timeout:
> {color:red}2014-08-03,23:02:02,219{color} INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
> # ZKFC1 detected "connection reset by peer"
> {color:red}2014-08-03,23:02:02,560{color} ERROR
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:[email protected] (auth:KERBEROS) cause:java.io.IOException:
> {color:red}Connection reset by peer{color}
> # NN1 wat restarted successfully by the watchdog:
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> Web-server up at: xx:13201
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server
> Responder: starting
> {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server:
> IPC Server listener on 13200: starting
> 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean
> thread started!
> 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> Registered DFSClientInformation MBean
> 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> NameNode up at: xx/xx:13200
> 2014-08-03,23:02:08,744 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services
> required for standby state
> # ZKFC1 retried the connection and considered NN1 was healthy
> {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client:
> Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1
> SECONDS)
> # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the
> failover, as a result, both NNs were standby.
> The root cause of this bug is that NN is restarted too quickly and ZKFC
> health monitor doesn't realize that.
--
This message was sent by Atlassian JIRA
(v6.2#6252)