LAXMAN KUMAR SAHOO created HDFS-7451: ----------------------------------------
Summary: Namenode HA failover happens very frequently from active to standby Key: HDFS-7451 URL: https://issues.apache.org/jira/browse/HDFS-7451 Project: Hadoop HDFS Issue Type: Bug Reporter: LAXMAN KUMAR SAHOO We have two namenode having HA enabled. From last couple of days we are observing that the failover happens very frequently from active to standby mode. Below is the log details of the active namenode during failover happens. Is there any fix to get rid of this issue? Namenode logs: {code} 2014-11-25 22:24:02,020 WARN org.apache.hadoop.ipc.Server: IPC Server Responder, call org.apache.hadoop.hdfs.protocol.Clie ntProtocol.getListing from 10.2.16.214:40751: output error 2014-11-25 22:24:02,020 INFO org.apache.hadoop.ipc.Server: IPC Server handler 23 on 8020 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2195) at org.apache.hadoop.ipc.Server.access$2000(Server.java:110) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:979) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1045) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1798) 2014-11-25 22:24:10,631 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /sda/dfs/namenode/current/edits_inprogress_0000000001643676954 -> /sda/dfs/namenode/current/edits_0000000001643676954-0000000001643677390 2014-11-25 22:24:10,631 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Closing java.lang.Exception at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.close(IPCLoggerChannel.java:182) at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.close(AsyncLoggerSet.java:102) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.close(QuorumJournalManager.java:446) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.close(JournalSet.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$4.apply(JournalSet.java:222) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:347) at org.apache.hadoop.hdfs.server.namenode.JournalSet.close(JournalSet.java:219) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.close(FSEditLog.java:308) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:939) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopActiveServices(NameNode.java:1365) at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.exitState(ActiveState.java:70) at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:61) at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.setState(ActiveState.java:52) at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToStandby(NameNode.java:1278) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToStandby(NameNodeRpcServer.java:1046) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3635) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) 2014-11-25 22:24:10,632 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for standby state 2014-11-25 22:24:10,633 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on active node at dc1-had03-m002.dc01.revsci.net/10.2.16.92:8020 every 120 seconds. 2014-11-25 22:24:10,634 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby checkpoint thread... Checkpointing active NN at dc1-had03-m002.dc01.revsci.net:50070 Serving checkpoints at dc1-had03-m001.dc01.revsci.net/10.2.16.91:50070 {code} zkfc logs: {code} 2014-11-25 22:24:12,192 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x449b8 ce9a110255, likely server has closed socket, closing socket connection and attempting reconnect 2014-11-25 22:24:12,293 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode... 2014-11-25 22:24:12,950 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dc1-had03-zook06.dc01.re vsci.net/10.2.16.205:2181. Will not attempt to authenticate using SASL (unknown error) 2014-11-25 22:24:12,951 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dc1-had03-zook06.dc01.revsc i.net/10.2.16.205:2181, initiating session 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x449b8ce9 a110255 has expired, closing socket connection 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session expired. Entering neutral mode and rejoini ng... 2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=dc1-had03-zook01. dc01.revsci.net:2181,dc1-had03-zook02.dc01.revsci.net:2181,dc1-had03-zook03.dc01.revsci.net:2181,dc1-had03-zook04.dc01.rev sci.net:2181,dc1-had03-zook05.dc01.revsci.net:2181,dc1-had03-zook06.dc01.revsci.net:2181 sessionTimeout=5000 watcher=org.a pache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7f042529 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181. Will not attempt to authenticate using SASL (unknown error) 2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, initiating session 2014-11-25 22:24:13,172 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, sessionid = 0x149a7f9b6d60263, negotiated timeout = 5000 2014-11-25 22:24:13,173 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2014-11-25 22:24:13,173 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x449b8ce9a110255 2014-11-25 22:24:13,173 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2014-11-25 22:24:13,389 INFO org.apache.hadoop.ha.ZKFailoverController: ZK Election indicated that NameNode at dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 should become standby 2014-11-25 22:24:13,391 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 to standby state {code} -laxman -- This message was sent by Atlassian JIRA (v6.3.4#6332)