[ 
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582802#comment-14582802
 ] 

lvchuanwen commented on HADOOP-10251:
-------------------------------------

############################hdfs-nn1-zkfc-host195.log################################

2015-06-12 02:25:38,799 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
zdh195/10.43.156.195:9000: Failed on local exception: java.io.EOFException; 
Host Details : local host is: "zdh195/10.43.156.195"; destination host is: 
"zdh195":9000; 
2015-06-12 02:25:38,799 INFO org.apache.hadoop.ha.HealthMonitor: Entering state 
SERVICE_NOT_RESPONDING
2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ZKFailoverController: Local 
service NameNode at zdh195/10.43.156.195:9000 entered state: 
SERVICE_NOT_RESPONDING
2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ZKFailoverController: 
Quitting master election for NameNode at zdh195/10.43.156.195:9000 and marking 
that fencing is necessary
2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election
2015-06-12 02:25:38,803 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x24d91acb5cb1c33 closed
2015-06-12 02:25:38,803 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x24d91acb5cb1c33
2015-06-12 02:25:38,803 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2015-06-12 02:25:40,805 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:25:40,807 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed 
on connection exception: java.net.ConnectException: 拒绝连接; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
...
...
...
2015-06-12 02:27:22,976 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed 
on connection exception: java.net.ConnectException: 拒绝连接; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
2015-06-12 02:27:24,978 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:27:24,979 WARN org.apache.hadoop.ha.HealthMonitor: 
Transport-level exception trying to monitor health of NameNode at 
zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed 
on connection exception: java.net.ConnectException: 拒绝连接; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
2015-06-12 02:27:26,981 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:27:27,810 INFO org.apache.hadoop.ha.HealthMonitor: Entering state 
SERVICE_HEALTHY
2015-06-12 02:27:27,810 INFO org.apache.hadoop.ha.ZKFailoverController: Local 
service NameNode at zdh195/10.43.156.195:9000 entered state: SERVICE_HEALTHY
2015-06-12 02:27:27,811 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=zdh196:2181,zdh195:2181,zdh197:2181 
sessionTimeout=10000 
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@651a6959
2015-06-12 02:27:27,812 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server zdh195/10.43.156.195:2181. Will not attempt to 
authenticate using SASL (unknown error)
2015-06-12 02:27:27,812 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to zdh195/10.43.156.195:2181, initiating session
2015-06-12 02:27:27,814 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server zdh195/10.43.156.195:2181, sessionid = 
0x24d91acb5cb1c51, negotiated timeout = 10000
2015-06-12 02:27:27,815 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
2015-06-12 02:27:27,816 INFO org.apache.hadoop.ha.ZKFailoverController: ZK 
Election indicated that NameNode at zdh195/10.43.156.195:9000 should become 
standby
2015-06-12 02:27:27,824 INFO org.apache.hadoop.ha.ZKFailoverController: 
Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to standby state
2015-06-12 02:27:48,662 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Checking for any old active which needs to be fenced...
2015-06-12 02:27:48,663 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old 
node exists: 0a0a636c757374657231393512036e6e321a067a646831393620a84628d33e
2015-06-12 02:27:48,665 INFO org.apache.hadoop.ha.ZKFailoverController: Should 
fence: NameNode at zdh196/10.43.156.196:9000
2015-06-12 02:27:49,668 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: zdh196/10.43.156.196:9000. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:27:49,670 WARN org.apache.hadoop.ha.FailoverController: Unable to 
gracefully make NameNode at zdh196/10.43.156.196:9000 standby (unable to 
connect)
java.net.ConnectException: Call From zdh195/10.43.156.195 to zdh196:9000 failed 
on connection exception: java.net.ConnectException: 拒绝连接; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.GeneratedConstructorAccessor24.newInstance(Unknown 
Source)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
        at org.apache.hadoop.ipc.Client.call(Client.java:1415)
        at org.apache.hadoop.ipc.Client.call(Client.java:1364)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at com.sun.proxy.$Proxy12.transitionToStandby(Unknown Source)
        at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
        at 
org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
        at 
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:516)
        at 
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:507)
        at 
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
        at 
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:894)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
Caused by: java.net.ConnectException: 拒绝连接
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
        at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
        at org.apache.hadoop.ipc.Client.call(Client.java:1382)
        ... 14 more
2015-06-12 02:27:49,671 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning 
Service Fencing Process... ======
2015-06-12 02:27:49,671 INFO org.apache.hadoop.ha.NodeFencer: Trying method 
1/2: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2015-06-12 02:27:49,674 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting 
to zdh196...
2015-06-12 02:27:49,674 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Connecting to zdh196 port 22
2015-06-12 02:27:49,676 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Connection established
2015-06-12 02:27:49,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Remote version string: SSH-2.0-OpenSSH_5.3
2015-06-12 02:27:49,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local 
version string: SSH-2.0-JSCH-0.1.42
2015-06-12 02:27:49,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
CheckCiphers: 
aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
aes256-ctr is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
aes192-ctr is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
aes256-cbc is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
aes192-cbc is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
arcfour256 is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
SSH_MSG_KEXINIT sent
2015-06-12 02:27:49,708 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
SSH_MSG_KEXINIT received
2015-06-12 02:27:49,708 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: 
server->client aes128-ctr hmac-md5 none
2015-06-12 02:27:49,708 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: 
client->server aes128-ctr hmac-md5 none
2015-06-12 02:27:49,712 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
SSH_MSG_KEXDH_INIT sent
2015-06-12 02:27:49,712 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
expecting SSH_MSG_KEXDH_REPLY
2015-06-12 02:27:49,719 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
ssh_rsa_verify: signature true
2015-06-12 02:27:49,720 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Permanently added 'zdh196' (RSA) to the list of known hosts.
2015-06-12 02:27:49,720 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
SSH_MSG_NEWKEYS sent
2015-06-12 02:27:49,720 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
SSH_MSG_NEWKEYS received
2015-06-12 02:27:49,721 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
SSH_MSG_SERVICE_REQUEST sent
2015-06-12 02:27:49,721 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
SSH_MSG_SERVICE_ACCEPT received
2015-06-12 02:27:49,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Authentications that can continue: 
gssapi-with-mic,publickey,keyboard-interactive,password
2015-06-12 02:27:49,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next 
authentication method: gssapi-with-mic
2015-06-12 02:27:49,727 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Authentications that can continue: publickey,keyboard-interactive,password
2015-06-12 02:27:49,727 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next 
authentication method: publickey
2015-06-12 02:27:49,771 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Authentication succeeded (publickey).
2015-06-12 02:27:49,772 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected 
to zdh196
2015-06-12 02:27:49,772 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking 
for process running on port 9000
2015-06-12 02:27:51,758 INFO org.apache.hadoop.ha.SshFenceByTcpPort: 
Indeterminate response from trying to kill service. Verifying whether it is 
running using nc...
2015-06-12 02:27:53,454 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Verified 
that the service is down.
2015-06-12 02:27:53,455 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Disconnecting from zdh196 port 22
2015-06-12 02:27:53,455 INFO org.apache.hadoop.ha.NodeFencer: ====== Fencing 
successful by method org.apache.hadoop.ha.SshFenceByTcpPort(null) ======
2015-06-12 02:27:53,455 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing 
znode /hadoop-ha/cluster195/ActiveBreadCrumb to indicate that the local node is 
the most recent active...
2015-06-12 02:27:53,455 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: 
Caught an exception, leaving main loop due to Socket closed

2015-06-12 02:27:53,458 INFO org.apache.hadoop.ha.ZKFailoverController: Trying 
to make NameNode at zdh195/10.43.156.195:9000 active...

2015-06-12 02:27:54,871 INFO org.apache.hadoop.ha.ZKFailoverController: 
Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to active state

2015-06-12 02:27:56,878 ERROR org.apache.hadoop.ha.ZKFailoverController: Local 
service NameNode at zdh195/10.43.156.195:9000 has changed the serviceState to 
active. Expected was standby. Quitting election marking fencing necessary.
2015-06-12 02:27:56,878 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election
2015-06-12 02:27:56,882 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x24d91acb5cb1c51 closed
2015-06-12 02:27:56,882 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x24d91acb5cb1c51
2015-06-12 02:27:56,882 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2015-06-12 02:27:57,885 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=zdh196:2181,zdh195:2181,zdh197:2181 
sessionTimeout=10000 
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@70f0acec
2015-06-12 02:27:57,886 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server zdh195/10.43.156.195:2181. Will not attempt to 
authenticate using SASL (unknown error)
2015-06-12 02:27:57,887 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to zdh195/10.43.156.195:2181, initiating session
2015-06-12 02:27:57,889 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server zdh195/10.43.156.195:2181, sessionid = 
0x24d91acb5cb1c52, negotiated timeout = 10000
2015-06-12 02:27:57,891 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
2015-06-12 02:27:57,895 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Checking for any old active which needs to be fenced...
2015-06-12 02:27:57,896 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old 
node exists: 0a0a636c757374657231393512036e6e311a067a646831393520a84628d33e
2015-06-12 02:27:57,896 INFO org.apache.hadoop.ha.ActiveStandbyElector: But old 
node has our own data, so don't need to fence it.
2015-06-12 02:27:57,896 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing 
znode /hadoop-ha/cluster195/ActiveBreadCrumb to indicate that the local node is 
the most recent active...

2015-06-12 02:27:57,899 INFO org.apache.hadoop.ha.ZKFailoverController: Trying 
to make NameNode at zdh195/10.43.156.195:9000 active...
2015-06-12 02:27:57,901 INFO org.apache.hadoop.ha.ZKFailoverController: 
Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to active state



> Both NameNodes could be in STANDBY State if SNN network is unstable
> -------------------------------------------------------------------
>
>                 Key: HADOOP-10251
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10251
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.2.0
>            Reporter: Vinayakumar B
>            Assignee: Vinayakumar B
>            Priority: Critical
>             Fix For: 2.5.0
>
>         Attachments: HADOOP-10251.patch, HADOOP-10251.patch, 
> HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch
>
>
> Following corner scenario happened in one of our cluster.
> 1. NN1 was Active and NN2 was Standby
> 2. NN2 machine's network was slow 
> 3. NN1 got shutdown.
> 4. NN2 ZKFC got the notification and trying to check for old active for 
> fencing. (This took little more time, again due to slow network)
> 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made 
> it Active.
> 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to 
> STANBY.
> 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and 
> got shutdown before making NN2 Active.
> *Now cluster having both NameNodes as STANDBY.*
> NN1 ZKFC still thinks that its nameNode is in Active state. 
> NN2 ZKFC waiting for election.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to