[jira] [Commented] (HDFS-6827) NameNode double standby

Hadoop QA (JIRA) Wed, 06 Aug 2014 01:59:26 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087438#comment-14087438
 ]


Hadoop QA commented on HDFS-6827:
---------------------------------

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12660080/HDFS-6827.1.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/7569//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7569//console

This message is automatically generated.

> NameNode double standby
> -----------------------
>
>                 Key: HDFS-6827
>                 URL: https://issues.apache.org/jira/browse/HDFS-6827
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.4.1
>            Reporter: Zesheng Wu
>            Assignee: Zesheng Wu
>         Attachments: HDFS-6827.1.patch
>
>
> In our production cluster, we encounter a scenario like this: ANN crashed due 
> to write journal timeout, and was restarted by the watchdog automatically, 
> but after restarting both of the NNs are standby.
> Following is the logs of the scenario:
> # NN1 is down due to write journal timeout:
> {color:red}2014-08-03,23:02:02,219{color} INFO 
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
> # ZKFC1 detected "connection reset by peer"
> {color:red}2014-08-03,23:02:02,560{color} ERROR 
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
> as:[email protected] (auth:KERBEROS) cause:java.io.IOException: 
> {color:red}Connection reset by peer{color}
> # NN1 wat restarted successfully by the watchdog:
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Web-server up at: xx:13201
> 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server 
> Responder: starting
> {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: 
> IPC Server listener on 13200: starting
> 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean 
> thread started!
> 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Registered DFSClientInformation MBean
> 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> NameNode up at: xx/xx:13200
> 2014-08-03,23:02:08,744 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for standby state
> # ZKFC1 retried the connection and considered NN1 was healthy
> {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: 
> Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 
> SECONDS)
> # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the 
> failover, as a result, both NNs were standby.
> The root cause of this bug is that NN is restarted too quickly and ZKFC 
> health monitor doesn't realize that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6827) NameNode double standby

Reply via email to