[jira] [Updated] (HDFS-6827) Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the target's status changing sometimes

2014-08-26 Thread Zesheng Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zesheng Wu updated HDFS-6827:
-

Resolution: Duplicate
Status: Resolved  (was: Patch Available)

Duplicate of HADOOP-10251.

 Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the 
 target's status changing sometimes
 --

 Key: HDFS-6827
 URL: https://issues.apache.org/jira/browse/HDFS-6827
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.1
Reporter: Zesheng Wu
Assignee: Zesheng Wu
Priority: Critical
 Attachments: HDFS-6827.1.patch


 In our production cluster, we encounter a scenario like this: ANN crashed due 
 to write journal timeout, and was restarted by the watchdog automatically, 
 but after restarting both of the NNs are standby.
 Following is the logs of the scenario:
 # NN1 is down due to write journal timeout:
 {color:red}2014-08-03,23:02:02,219{color} INFO 
 org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
 # ZKFC1 detected connection reset by peer
 {color:red}2014-08-03,23:02:02,560{color} ERROR 
 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
 as:xx@xx.HADOOP (auth:KERBEROS) cause:java.io.IOException: 
 {color:red}Connection reset by peer{color}
 # NN1 wat restarted successfully by the watchdog:
 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Web-server up at: xx:13201
 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: 
 IPC Server listener on 13200: starting
 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean 
 thread started!
 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Registered DFSClientInformation MBean
 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 NameNode up at: xx/xx:13200
 2014-08-03,23:02:08,744 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
 required for standby state
 # ZKFC1 retried the connection and considered NN1 was healthy
 {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: 
 Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 
 SECONDS)
 # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the 
 failover, as a result, both NNs were standby.
 The root cause of this bug is that NN is restarted too quickly and ZKFC 
 health monitor doesn't realize that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6827) Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the target's status changing sometimes

2014-08-19 Thread Zesheng Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zesheng Wu updated HDFS-6827:
-

Summary: Both NameNodes stuck in STANDBY state due to HealthMonitor not 
aware of the target's status changing sometimes  (was: Both NameNodes could be 
in STANDBY State due to HealthMonitor not aware of the target's status changing 
sometimes)

 Both NameNodes stuck in STANDBY state due to HealthMonitor not aware of the 
 target's status changing sometimes
 --

 Key: HDFS-6827
 URL: https://issues.apache.org/jira/browse/HDFS-6827
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.1
Reporter: Zesheng Wu
Assignee: Zesheng Wu
Priority: Critical
 Attachments: HDFS-6827.1.patch


 In our production cluster, we encounter a scenario like this: ANN crashed due 
 to write journal timeout, and was restarted by the watchdog automatically, 
 but after restarting both of the NNs are standby.
 Following is the logs of the scenario:
 # NN1 is down due to write journal timeout:
 {color:red}2014-08-03,23:02:02,219{color} INFO 
 org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
 # ZKFC1 detected connection reset by peer
 {color:red}2014-08-03,23:02:02,560{color} ERROR 
 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
 as:xx@xx.HADOOP (auth:KERBEROS) cause:java.io.IOException: 
 {color:red}Connection reset by peer{color}
 # NN1 wat restarted successfully by the watchdog:
 2014-08-03,23:02:07,884 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Web-server up at: xx:13201
 2014-08-03,23:02:07,884 INFO org.apache.hadoop.ipc.Server: IPC Server 
 Responder: starting
 {color:red}2014-08-03,23:02:07,884{color} INFO org.apache.hadoop.ipc.Server: 
 IPC Server listener on 13200: starting
 2014-08-03,23:02:08,742 INFO org.apache.hadoop.ipc.Server: RPC server clean 
 thread started!
 2014-08-03,23:02:08,743 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 Registered DFSClientInformation MBean
 2014-08-03,23:02:08,744 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
 NameNode up at: xx/xx:13200
 2014-08-03,23:02:08,744 INFO 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
 required for standby state
 # ZKFC1 retried the connection and considered NN1 was healthy
 {color:red}2014-08-03,23:02:08,292{color} INFO org.apache.hadoop.ipc.Client: 
 Retrying connect to server: xx/xx:13200. Already tried 0 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 
 SECONDS)
 # ZKFC1 still considered NN1 as a healthy Active NN, and didn't trigger the 
 failover, as a result, both NNs were standby.
 The root cause of this bug is that NN is restarted too quickly and ZKFC 
 health monitor doesn't realize that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)