[
https://issues.apache.org/jira/browse/HADOOP-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582802#comment-14582802
]
lvchuanwen commented on HADOOP-10251:
-------------------------------------
############################hdfs-nn1-zkfc-host195.log################################
2015-06-12 02:25:38,799 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
zdh195/10.43.156.195:9000: Failed on local exception: java.io.EOFException;
Host Details : local host is: "zdh195/10.43.156.195"; destination host is:
"zdh195":9000;
2015-06-12 02:25:38,799 INFO org.apache.hadoop.ha.HealthMonitor: Entering state
SERVICE_NOT_RESPONDING
2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ZKFailoverController: Local
service NameNode at zdh195/10.43.156.195:9000 entered state:
SERVICE_NOT_RESPONDING
2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ZKFailoverController:
Quitting master election for NameNode at zdh195/10.43.156.195:9000 and marking
that fencing is necessary
2015-06-12 02:25:38,800 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2015-06-12 02:25:38,803 INFO org.apache.zookeeper.ZooKeeper: Session:
0x24d91acb5cb1c33 closed
2015-06-12 02:25:38,803 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x24d91acb5cb1c33
2015-06-12 02:25:38,803 INFO org.apache.zookeeper.ClientCnxn: EventThread shut
down
2015-06-12 02:25:40,805 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:25:40,807 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed
on connection exception: java.net.ConnectException: 拒绝连接; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
...
...
...
2015-06-12 02:27:22,976 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed
on connection exception: java.net.ConnectException: 拒绝连接; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
2015-06-12 02:27:24,978 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:27:24,979 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
zdh195/10.43.156.195:9000: Call From zdh195/10.43.156.195 to zdh195:9000 failed
on connection exception: java.net.ConnectException: 拒绝连接; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
2015-06-12 02:27:26,981 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: zdh195/10.43.156.195:9000. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:27:27,810 INFO org.apache.hadoop.ha.HealthMonitor: Entering state
SERVICE_HEALTHY
2015-06-12 02:27:27,810 INFO org.apache.hadoop.ha.ZKFailoverController: Local
service NameNode at zdh195/10.43.156.195:9000 entered state: SERVICE_HEALTHY
2015-06-12 02:27:27,811 INFO org.apache.zookeeper.ZooKeeper: Initiating client
connection, connectString=zdh196:2181,zdh195:2181,zdh197:2181
sessionTimeout=10000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@651a6959
2015-06-12 02:27:27,812 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server zdh195/10.43.156.195:2181. Will not attempt to
authenticate using SASL (unknown error)
2015-06-12 02:27:27,812 INFO org.apache.zookeeper.ClientCnxn: Socket connection
established to zdh195/10.43.156.195:2181, initiating session
2015-06-12 02:27:27,814 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server zdh195/10.43.156.195:2181, sessionid =
0x24d91acb5cb1c51, negotiated timeout = 10000
2015-06-12 02:27:27,815 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session
connected.
2015-06-12 02:27:27,816 INFO org.apache.hadoop.ha.ZKFailoverController: ZK
Election indicated that NameNode at zdh195/10.43.156.195:9000 should become
standby
2015-06-12 02:27:27,824 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to standby state
2015-06-12 02:27:48,662 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2015-06-12 02:27:48,663 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists: 0a0a636c757374657231393512036e6e321a067a646831393620a84628d33e
2015-06-12 02:27:48,665 INFO org.apache.hadoop.ha.ZKFailoverController: Should
fence: NameNode at zdh196/10.43.156.196:9000
2015-06-12 02:27:49,668 INFO org.apache.hadoop.ipc.Client: Retrying connect to
server: zdh196/10.43.156.196:9000. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2015-06-12 02:27:49,670 WARN org.apache.hadoop.ha.FailoverController: Unable to
gracefully make NameNode at zdh196/10.43.156.196:9000 standby (unable to
connect)
java.net.ConnectException: Call From zdh195/10.43.156.195 to zdh196:9000 failed
on connection exception: java.net.ConnectException: 拒绝连接; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor24.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy12.transitionToStandby(Unknown Source)
at
org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
at
org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:516)
at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:507)
at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:894)
at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
Caused by: java.net.ConnectException: 拒绝连接
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
... 14 more
2015-06-12 02:27:49,671 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning
Service Fencing Process... ======
2015-06-12 02:27:49,671 INFO org.apache.hadoop.ha.NodeFencer: Trying method
1/2: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2015-06-12 02:27:49,674 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting
to zdh196...
2015-06-12 02:27:49,674 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connecting to zdh196 port 22
2015-06-12 02:27:49,676 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connection established
2015-06-12 02:27:49,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Remote version string: SSH-2.0-OpenSSH_5.3
2015-06-12 02:27:49,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local
version string: SSH-2.0-JSCH-0.1.42
2015-06-12 02:27:49,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
CheckCiphers:
aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes256-ctr is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes192-ctr is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes256-cbc is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
aes192-cbc is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
arcfour256 is not available.
2015-06-12 02:27:49,707 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXINIT sent
2015-06-12 02:27:49,708 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXINIT received
2015-06-12 02:27:49,708 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex:
server->client aes128-ctr hmac-md5 none
2015-06-12 02:27:49,708 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex:
client->server aes128-ctr hmac-md5 none
2015-06-12 02:27:49,712 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_KEXDH_INIT sent
2015-06-12 02:27:49,712 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
expecting SSH_MSG_KEXDH_REPLY
2015-06-12 02:27:49,719 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
ssh_rsa_verify: signature true
2015-06-12 02:27:49,720 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Permanently added 'zdh196' (RSA) to the list of known hosts.
2015-06-12 02:27:49,720 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_NEWKEYS sent
2015-06-12 02:27:49,720 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_NEWKEYS received
2015-06-12 02:27:49,721 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_SERVICE_REQUEST sent
2015-06-12 02:27:49,721 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
SSH_MSG_SERVICE_ACCEPT received
2015-06-12 02:27:49,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentications that can continue:
gssapi-with-mic,publickey,keyboard-interactive,password
2015-06-12 02:27:49,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next
authentication method: gssapi-with-mic
2015-06-12 02:27:49,727 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentications that can continue: publickey,keyboard-interactive,password
2015-06-12 02:27:49,727 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next
authentication method: publickey
2015-06-12 02:27:49,771 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Authentication succeeded (publickey).
2015-06-12 02:27:49,772 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected
to zdh196
2015-06-12 02:27:49,772 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking
for process running on port 9000
2015-06-12 02:27:51,758 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Indeterminate response from trying to kill service. Verifying whether it is
running using nc...
2015-06-12 02:27:53,454 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Verified
that the service is down.
2015-06-12 02:27:53,455 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Disconnecting from zdh196 port 22
2015-06-12 02:27:53,455 INFO org.apache.hadoop.ha.NodeFencer: ====== Fencing
successful by method org.apache.hadoop.ha.SshFenceByTcpPort(null) ======
2015-06-12 02:27:53,455 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing
znode /hadoop-ha/cluster195/ActiveBreadCrumb to indicate that the local node is
the most recent active...
2015-06-12 02:27:53,455 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Caught an exception, leaving main loop due to Socket closed
2015-06-12 02:27:53,458 INFO org.apache.hadoop.ha.ZKFailoverController: Trying
to make NameNode at zdh195/10.43.156.195:9000 active...
2015-06-12 02:27:54,871 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to active state
2015-06-12 02:27:56,878 ERROR org.apache.hadoop.ha.ZKFailoverController: Local
service NameNode at zdh195/10.43.156.195:9000 has changed the serviceState to
active. Expected was standby. Quitting election marking fencing necessary.
2015-06-12 02:27:56,878 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2015-06-12 02:27:56,882 INFO org.apache.zookeeper.ZooKeeper: Session:
0x24d91acb5cb1c51 closed
2015-06-12 02:27:56,882 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x24d91acb5cb1c51
2015-06-12 02:27:56,882 INFO org.apache.zookeeper.ClientCnxn: EventThread shut
down
2015-06-12 02:27:57,885 INFO org.apache.zookeeper.ZooKeeper: Initiating client
connection, connectString=zdh196:2181,zdh195:2181,zdh197:2181
sessionTimeout=10000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@70f0acec
2015-06-12 02:27:57,886 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server zdh195/10.43.156.195:2181. Will not attempt to
authenticate using SASL (unknown error)
2015-06-12 02:27:57,887 INFO org.apache.zookeeper.ClientCnxn: Socket connection
established to zdh195/10.43.156.195:2181, initiating session
2015-06-12 02:27:57,889 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server zdh195/10.43.156.195:2181, sessionid =
0x24d91acb5cb1c52, negotiated timeout = 10000
2015-06-12 02:27:57,891 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session
connected.
2015-06-12 02:27:57,895 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2015-06-12 02:27:57,896 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists: 0a0a636c757374657231393512036e6e311a067a646831393520a84628d33e
2015-06-12 02:27:57,896 INFO org.apache.hadoop.ha.ActiveStandbyElector: But old
node has our own data, so don't need to fence it.
2015-06-12 02:27:57,896 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing
znode /hadoop-ha/cluster195/ActiveBreadCrumb to indicate that the local node is
the most recent active...
2015-06-12 02:27:57,899 INFO org.apache.hadoop.ha.ZKFailoverController: Trying
to make NameNode at zdh195/10.43.156.195:9000 active...
2015-06-12 02:27:57,901 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at zdh195/10.43.156.195:9000 to active state
> Both NameNodes could be in STANDBY State if SNN network is unstable
> -------------------------------------------------------------------
>
> Key: HADOOP-10251
> URL: https://issues.apache.org/jira/browse/HADOOP-10251
> Project: Hadoop Common
> Issue Type: Bug
> Components: ha
> Affects Versions: 2.2.0
> Reporter: Vinayakumar B
> Assignee: Vinayakumar B
> Priority: Critical
> Fix For: 2.5.0
>
> Attachments: HADOOP-10251.patch, HADOOP-10251.patch,
> HADOOP-10251.patch, HADOOP-10251.patch, HADOOP-10251.patch
>
>
> Following corner scenario happened in one of our cluster.
> 1. NN1 was Active and NN2 was Standby
> 2. NN2 machine's network was slow
> 3. NN1 got shutdown.
> 4. NN2 ZKFC got the notification and trying to check for old active for
> fencing. (This took little more time, again due to slow network)
> 5. In between, NN1 got restarted by our automatic monitoring, and ZKFC made
> it Active.
> 6. Now NN2 ZKFC got Old Active as NN1 and it did graceful fencing of NN1 to
> STANBY.
> 7. Before writing ActiveBreadCrumb to ZK, NN2 ZKFC got session timeout and
> got shutdown before making NN2 Active.
> *Now cluster having both NameNodes as STANDBY.*
> NN1 ZKFC still thinks that its nameNode is in Active state.
> NN2 ZKFC waiting for election.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)