LAXMAN KUMAR SAHOO created HDFS-7451:
----------------------------------------

             Summary: Namenode HA failover happens very frequently from active 
to standby
                 Key: HDFS-7451
                 URL: https://issues.apache.org/jira/browse/HDFS-7451
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: LAXMAN KUMAR SAHOO


We have two namenode having HA enabled. From last couple of days we are 
observing that the failover happens very frequently from active to standby 
mode. Below is the log details of the active namenode during failover happens. 
Is there any fix to get rid of this issue?

Namenode logs:

{code}
2014-11-25 22:24:02,020 WARN org.apache.hadoop.ipc.Server: IPC Server 
Responder, call org.apache.hadoop.hdfs.protocol.Clie
ntProtocol.getListing from 10.2.16.214:40751: output error
2014-11-25 22:24:02,020 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
23 on 8020 caught an exception
java.nio.channels.ClosedChannelException
        at 
sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
        at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2195)
        at org.apache.hadoop.ipc.Server.access$2000(Server.java:110)
        at 
org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:979)
        at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1045)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1798)







2014-11-25 22:24:10,631 INFO 
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits 
file /sda/dfs/namenode/current/edits_inprogress_0000000001643676954 -> 
/sda/dfs/namenode/current/edits_0000000001643676954-0000000001643677390
2014-11-25 22:24:10,631 INFO 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Closing
java.lang.Exception
        at 
org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.close(IPCLoggerChannel.java:182)
        at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.close(AsyncLoggerSet.java:102)
        at 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.close(QuorumJournalManager.java:446)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.close(JournalSet.java:107)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$4.apply(JournalSet.java:222)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:347)
        at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.close(JournalSet.java:219)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.close(FSEditLog.java:308)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:939)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.stopActiveServices(NameNode.java:1365)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.exitState(ActiveState.java:70)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:61)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.setState(ActiveState.java:52)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToStandby(NameNode.java:1278)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToStandby(NameNodeRpcServer.java:1046)
        at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119)
        at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3635)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)


2014-11-25 22:24:10,632 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required 
for standby state
2014-11-25 22:24:10,633 INFO 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on 
active node at dc1-had03-m002.dc01.revsci.net/10.2.16.92:8020 every 120 seconds.
2014-11-25 22:24:10,634 INFO 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting standby 
checkpoint thread...
Checkpointing active NN at dc1-had03-m002.dc01.revsci.net:50070
Serving checkpoints at dc1-had03-m001.dc01.revsci.net/10.2.16.91:50070
{code}

zkfc logs:
{code}
2014-11-25 22:24:12,192 INFO org.apache.zookeeper.ClientCnxn: Unable to read 
additional data from server sessionid 0x449b8
ce9a110255, likely server has closed socket, closing socket connection and 
attempting reconnect
2014-11-25 22:24:12,293 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
disconnected. Entering neutral mode...
2014-11-25 22:24:12,950 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server dc1-had03-zook06.dc01.re
vsci.net/10.2.16.205:2181. Will not attempt to authenticate using SASL (unknown 
error)
2014-11-25 22:24:12,951 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to dc1-had03-zook06.dc01.revsc
i.net/10.2.16.205:2181, initiating session
2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x449b8ce9
a110255 has expired, closing socket connection
2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
expired. Entering neutral mode and rejoini
ng...
2014-11-25 22:24:12,952 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session
2014-11-25 22:24:12,952 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=dc1-had03-zook01.
dc01.revsci.net:2181,dc1-had03-zook02.dc01.revsci.net:2181,dc1-had03-zook03.dc01.revsci.net:2181,dc1-had03-zook04.dc01.rev
sci.net:2181,dc1-had03-zook05.dc01.revsci.net:2181,dc1-had03-zook06.dc01.revsci.net:2181
 sessionTimeout=5000 watcher=org.a
pache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@7f042529
2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181. Will 
not attempt to authenticate using SASL (unknown error)
2014-11-25 22:24:12,954 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, initiating 
session
2014-11-25 22:24:13,172 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server 
dc1-had03-zook01.dc01.revsci.net/10.2.16.200:2181, sessionid = 
0x149a7f9b6d60263, negotiated timeout = 5000
2014-11-25 22:24:13,173 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
2014-11-25 22:24:13,173 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x449b8ce9a110255
2014-11-25 22:24:13,173 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2014-11-25 22:24:13,389 INFO org.apache.hadoop.ha.ZKFailoverController: ZK 
Election indicated that NameNode at 
dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 should become standby
2014-11-25 22:24:13,391 INFO org.apache.hadoop.ha.ZKFailoverController: 
Successfully transitioned NameNode at 
dc1-had03-m001.dc01.revsci.net/10.2.16.91:8020 to standby state
{code}

-laxman



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to