[
https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101775#comment-14101775
]
Zesheng Wu commented on HDFS-6827:
----------------------------------
bq. BTW: The healthCheckLock is used to distinguish the graceful failover and
the above scenario.
bq. I don't know this code well. How is the above done?
bq. The code in checkServiceStatus seems 'fragile' looking for a explicity
transition. Is there a more explicit check that can be done to learn if
'service is restarted'?
The solution for this issue is to let the ZKFC learn 'service is restarted'.
One straightforward way is to add a field in the {{MonitorHealthResponseProto}}
to identify that the service is restarted, for example the pid of the NN
process, or a generated UUID will satisfy our requirement. Another way is to
let the ZKFC learn 'service is restarted' by comparing the service's current
state and last state. We choose the later one, in this way we can fix the
problem inside ZKFC, don't influence other services.
As we know that ZKFC supports gracefully failover from the command line tool,
and during graceful failover, the ZKFC may encounter a scenario like this: the
last state of the service Active, the current is Standby, and the service is
healthy. This scenario is just the same as the buggy scenario described above,
we must distinguish these two scenarios. So we add the {{healthCheckLock}} to
'fragile' the health checking when doing graceful failover.
Hope I expressed myself clearly:)
> Both NameNodes could be in STANDBY State due to HealthMonitor not aware of
> the target's status changing sometimes
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-6827
> URL: https://issues.apache.org/jira/browse/HDFS-6827
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.4.1
> Reporter: Zesheng Wu
> Assignee: Zesheng Wu
> Priority: Critical
> Attachments: HDFS-6827.1.patch
>
>
> In our production cluster, we encounter a scenario like this: ANN crashed due
> to write journal timeout, and was restarted by the watchdog automatically,
> but after restarting both of the NNs are standby.
> Following is the logs of the scenario:
> # NN1 is down due to write journal timeout:
> {color:red}2014-08-03,23:02:02,219{color} INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
> # ZKFC1 detected "connection reset by peer"
> {color:red}2014-08-03,23:02:02,560{color} ERROR
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:[email protected] (auth:KERBEROS) cause:java.io.IOException:
> {color:red}Connection reset by peer{color}
> # NN1 wat restarted successfully by the watchdog:
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> Web-server up at: xx:13201
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server
> Responder: starting
> {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server:
> IPC Server listener on 13200: starting
> 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean
> thread started!
> 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> Registered DFSClientInformation MBean
> 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> NameNode up at: xx/xx:13200
> 2014-08-03,23:02:08,744 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services
> required for standby state
> # ZKFC1 retried the connection and considered NN1 was healthy
> {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client:
> Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1
> SECONDS)
> # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the
> failover, as a result, both NNs were standby.
> The root cause of this bug is that NN is restarted too quickly and ZKFC
> health monitor doesn't realize that.
--
This message was sent by Atlassian JIRA
(v6.2#6252)