[
https://issues.apache.org/jira/browse/HDFS-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
LAXMAN KUMAR SAHOO reassigned HDFS-7451:
----------------------------------------
Assignee: LAXMAN KUMAR SAHOO
> Namenode HA failover happens very frequently from active to standby
> -------------------------------------------------------------------
>
> Key: HDFS-7451
> URL: https://issues.apache.org/jira/browse/HDFS-7451
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: LAXMAN KUMAR SAHOO
> Assignee: LAXMAN KUMAR SAHOO
>
> We have two namenode having HA enabled. From last couple of days we are
> observing that the failover happens very frequently from active to standby
> mode. Below is the log details of the active namenode during failover
> happens. Is there any fix to get rid of this issue?
> Namenode logs:
> {code}
> 2014-11-25 22:24:02,020 WARN org.apache.hadoop.ipc.Server: IPC Server
> Responder, call org.apache.hadoop.hdfs.protocol.Clie
> ntProtocol.getListing from 10.2.16.214:40751: output error
> 2014-11-25 22:24:02,020 INFO org.apache.hadoop.ipc.Server: IPC Server handler
> 23 on 8020 caught an exception
> java.nio.channels.ClosedChannelException
> at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2195)
> at org.apache.hadoop.ipc.Server.access$2000(Server.java:110)
> at
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:979)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1045)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1798)
> 2014-11-25 22:24:10,631 INFO
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits
> file /sda/dfs/namenode/current/edits_inprogress_0000000001643676954 ->
> /sda/dfs/namenode/current/edits_0000000001643676954-0000000001643677390
> 2014-11-25 22:24:10,631 INFO
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Closing
> java.lang.Exception
> at
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.close(IPCLoggerChannel.java:182)
> at
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.close(AsyncLoggerSet.java:102)
> at
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.close(QuorumJournalManager.java:446)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.close(JournalSet.java:107)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$4.apply(JournalSet.java:222)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:347)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.close(JournalSet.java:219)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.close(FSEditLog.java:308)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:939)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopActiveServices(NameNode.java:1365)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.exitState(ActiveState.java:70)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:61)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.setState(ActiveState.java:52)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToStandby(NameNode.java:1278)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToStandby(NameNodeRpcServer.java:1046)
> at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119)
> at
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3635)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> 2014-11-25 22:24:10,632 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services
> required for standby state
> 2014-11-25 22:24:10,633 INFO
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on
> active node at dc1-had03-m002.dc01.revsci.net/10.2.16.92:8020 every 120
> seconds.
> 2014-11-25 22:24:10,634 INFO
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting
> standby checkpoint thread...
> Checkpointing active NN at dc1-had03-m002.dc01.revsci.net:50070
> Serving checkpoints at dc1-had03-m001.dc01.revsci.net/10.2.16.91:50070
> {code}
> zkfc logs:
> {code}
> 2014-11-25 22:24:12,192 INFO org.apache.zookeeper.ClientCnxn: Unable to read
> additional data from server sessionid 0x449b8
> ce9a110255, likely server has closed socket, closing socket connection and
> attempting reconnect
> 2014-11-25 22:24:12,293 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Session disconnected. Entering neutral mode...
> 2014-11-25 22:24:12,950 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server dc1-had03-zook06.dc01.re
> vsci.net/10.2.16.205:2181. Will not attempt to authenticate using SASL
> (unknown error)
> 2014-11-25 22:24:12,951 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to dc1-had03-zook06.dc01.revsc
> i.net/10.2.16.205:2181, initiating session
> 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x449b8ce9
> a110255 has expired, closing socket connection
> 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Session expired. Entering neutral mode and rejoini
> ng...
> 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Trying to re-establish ZK session
> 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ZooKeeper: Initiating
> client connection, connectString=dc1-had03-zook01.
> dc01.revsci.net:2181,dc1-had03-zook02.dc01.revsci.net:2181,dc1-had03-zook03.dc01.revsci.net:2181,dc1-had03-zook04.dc01.rev
> sci.net:2181,dc1-had03-zook05.dc01.revsci.net:2181,dc1-had03-zook06.dc01.revsci.net:2181
> sessionTimeout=5000 watcher=org.a
> pache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7f042529
> 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181. Will
> not attempt to authenticate using SASL (unknown error)
> 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181,
> initiating session
> 2014-11-25 22:24:13,172 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server
> dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, sessionid =
> 0x149a7f9b6d60263, negotiated timeout = 5000
> 2014-11-25 22:24:13,173 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Session connected.
> 2014-11-25 22:24:13,173 WARN org.apache.hadoop.ha.ActiveStandbyElector:
> Ignoring stale result from old client with sessionId 0x449b8ce9a110255
> 2014-11-25 22:24:13,173 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
> 2014-11-25 22:24:13,389 INFO org.apache.hadoop.ha.ZKFailoverController: ZK
> Election indicated that NameNode at
> dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 should become standby
> 2014-11-25 22:24:13,391 INFO org.apache.hadoop.ha.ZKFailoverController:
> Successfully transitioned NameNode at
> dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 to standby state
> {code}
> -laxman
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)