[ 
https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103261#comment-14103261
 ] 

Zesheng Wu commented on HDFS-6827:
----------------------------------

Thanks [~stack]:)

bq. Yes. Is there anything more definitive than a check for a particular state 
transition? (Sorry, don't know this area well).
If we want to fix the bug inside ZKFC, there's no other definitive indicator 
according to my current knowledge of ZKFC. 

bq. This seems less prone to misinterpretation.
Yes, this is more straightforward and less prone to misinterpretation. But 
change the {{MonitorHealthResponseProto}} proto may introduce an incompatible 
changing, if folks think this is acceptable, perhaps we can use this method.

bq. Your NN came up inside a second?
As I described in the issue description, our NN came up inside about 6 seconds. 
1 second for {{Client#handleConnectionFailure()}} sleep, the other 5 seconds 
for some unknown reasons, maybe GC or network problems, we haven't found direct 
evidences. 

bq. A hacky workaround in meantime would have the NN start sleep first for a 
second?
Yes,  we can let NN sleep sometime before startup. Indeed we use this method to 
quick fix the bug in our production cluster temporarily. But for a long term 
and general solution, we should fix this in the ZKFC side. One more thing, ZKFC 
is a general automatic HA failover framework, it is used in HDFS, but not only 
for HDFS, it may be used in other system who needs automatic HA failover. From 
this perspective, we should fix this inside ZKFC.


> Both NameNodes could be in STANDBY State due to HealthMonitor not aware of 
> the target's status changing sometimes
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6827
>                 URL: https://issues.apache.org/jira/browse/HDFS-6827
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.1
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>            Priority: Critical
>         Attachments: HDFS-6827.1.patch
>
>
> In our production cluster, we encounter a scenario like this: ANN crashed due 
> to write journal timeout, and was restarted by the watchdog automatically, 
> but after restarting both of the NNs are standby.
> Following is the logs of the scenario:
> # NN1 is down due to write journal timeout:
> {color:red}2014-08-03,23:02:02,219{color} INFO 
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
> # ZKFC1 detected "connection reset by peer"
> {color:red}2014-08-03,23:02:02,560{color} ERROR 
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
> as:[email protected] (auth:KERBEROS) cause:java.io.IOException: 
> {color:red}Connection reset by peer{color}
> # NN1 wat restarted successfully by the watchdog:
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Web-server up at: xx:13201
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server 
> Responder: starting
> {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: 
> IPC Server listener on 13200: starting
> 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean 
> thread started!
> 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Registered DFSClientInformation MBean
> 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> NameNode up at: xx/xx:13200
> 2014-08-03,23:02:08,744 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for standby state
> # ZKFC1 retried the connection and considered NN1 was healthy
> {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: 
> Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 
> SECONDS)
> # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the 
> failover, as a result, both NNs were standby.
> The root cause of this bug is that NN is restarted too quickly and ZKFC 
> health monitor doesn't realize that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to