[ 
https://issues.apache.org/jira/browse/HADOOP-16724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wupeng updated HADOOP-16724:
----------------------------
    Summary: namenodes in our cluster often have their states changed from 
active to standby, the ERROR log is IPC's epoch 187 is less than the last 
promised epoch 188  (was: namenodes in our cluster often have their states 
changed from active to standby., the ERROR log is IPC's epoch 187 is less than 
the last promised epoch 188)

> namenodes in our cluster often have their states changed from active to 
> standby, the ERROR log is IPC's epoch 187 is less than the last promised 
> epoch 188
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16724
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16724
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 2.7.3
>            Reporter: wupeng
>            Priority: Major
>
> Recently, namenodes in our cluster often have their states changed from 
> active to standby.
> Below is the log:
> active namenode :
> Serving checkpoints at http://zk14:50070
> 2019-11-24 08:43:09,765 WARN client.QuorumJournalManager 
> (IPCLoggerChannel.java:call(406)) - Took 17727ms to send a batch of 1 edits 
> (208 bytes) to remote journal 52.82.32.47.2:8485
> 2019-11-24 08:43:10,006 WARN client.QuorumJournalManager 
> (IPCLoggerChannel.java:call(388)) - Remote journal 52.82.32.47.2:8485 failed 
> to write txns 1942183681-1942183681. Will try to write to this JN again after 
> the next log roll.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 187 
> is less than the last promised epoch 188
>  at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)
>  at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456)
>  at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351)
>  at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152)
>  at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>  at 
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
>  at org.apache.hadoop.ipc.Client.call(Client.java:1498)
>  at org.apache.hadoop.ipc.Client.call(Client.java:1398)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>  at com.sun.proxy.$Proxy11.journal(Unknown Source)
>  at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
>  at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
>  at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-11-24 08:43:19,695 INFO namenode.TransferFsImage 
> (TransferFsImage.java:receiveFile(575)) - Combined time for fsimage download 
> and fsync to all disks took 9.35s. The fsimage download took 9.35s at 
> 156316.01 KB/s. Synchronous (fsync) write to disk of 
> /hadoop/hdfs/namenode/current/fsimage.ckpt_0000000001942183423 took 0.00s. 
> Synchronous (fsync) write to disk of 
> /data10_4T/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000001942183423 took 
> 0.00s.
> 2019-11-24 08:43:19,695 INFO namenode.TransferFsImage 
> (TransferFsImage.java:handleUploadImageRequest(135)) - Downloaded file 
> fsimage.ckpt_0000000001942183423 size 1496792686 bytes.
> 2019-11-24 08:43:19,807 INFO namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 
> 2 images with txid >= 1941733640
> master zkfc:
> 2019-11-24 08:43:04,352 INFO zookeeper.ClientCnxn 
> (ClientCnxn.java:onConnected(1279)) - Session establishment complete on 
> server zk14/52.82.32.47.14:2181, sessionid = 0x269c073377c56e4, negotiated 
> timeout = 20000
> 2019-11-24 08:43:06,557 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(868)) - Exception handling the 
> winning of election
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /hadoop-ha/ns/ActiveBreadCrumb
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>  at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector$5.run(ActiveStandbyElector.java:1025)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector$5.run(ActiveStandbyElector.java:1022)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1067)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1059)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.setDataWithRetries(ActiveStandbyElector.java:1022)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.writeBreadCrumbNode(ActiveStandbyElector.java:891)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:861)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:534)
>  at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
>  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> 2019-11-24 08:43:06,557 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:reJoinElection(723)) - Trying to re-establish ZK 
> session
> 2019-11-24 08:43:06,557 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK 
> connection for elector id=324457684 
> appData=0a026e7312036e6e321a047a6b313420d43e28d33e cb=Elector callbacks for 
> NameNode at zk14/52.82.32.47.14:8020
> 2019-11-24 08:43:06,605 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) 
> - Session: 0x269c073377c56e4 closed
> 2019-11-24 08:43:07,605 INFO zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) 
> - Initiating client connection, connectString=zk14:2181,zk2:2181,zk1:2181 
> sessionTimeout=20000 
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@538e9048
> 2019-11-24 08:43:07,606 INFO zookeeper.ClientCnxn 
> (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server 
> zk14/52.82.32.47.14:2181. Will not attempt to authenticate using SASL 
> (unknown error)
> 2019-11-24 08:43:07,607 INFO zookeeper.ClientCnxn 
> (ClientCnxn.java:primeConnection(864)) - Socket connection established to 
> zk14/52.82.32.47.14:2181, initiating session
> 2019-11-24 08:43:07,631 INFO zookeeper.ClientCnxn 
> (ClientCnxn.java:onConnected(1279)) - Session establishment complete on 
> server zk14/52.82.32.47.14:2181, sessionid = 0x269c073377c57c6, negotiated 
> timeout = 20000
> 2019-11-24 08:43:07,631 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:processWatchEvent(600)) - Session connected.
> 2019-11-24 08:43:07,631 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
> old client with sessionId 0x269c073377c56e4
> 2019-11-24 08:43:07,632 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
> old client with sessionId 0x269c073377c56e4
> 2019-11-24 08:43:07,632 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
> old client with sessionId 0x269c073377c56e4
> follower zkfc:
> 2019-11-24 08:43:06,577 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:becomeStandby(484)) - ZK Election indicated that 
> NameNode at zk2/52.82.32.47.2:8020 should become standby
> 2019-11-24 08:43:06,579 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:becomeStandby(489)) - Successfully transitioned 
> NameNode at zk2/52.82.32.47.2:8020 to standby state
> 2019-11-24 08:43:06,613 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:fenceOldActive(936)) - Checking for any old active 
> which needs to be fenced...
> 2019-11-24 08:43:06,613 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:fenceOldActive(957)) - Old node exists: 
> 0a026e7312036e6e321a047a6b313420d43e28d33e
> 2019-11-24 08:43:06,613 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:doFence(513)) - Should fence: NameNode at 
> zk14/52.82.32.47.14:8020
> 2019-11-24 08:43:07,159 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:doFence(519)) - Successfully transitioned NameNode 
> at zk14/52.82.32.47.14:8020 to standby state without fencing
> 2019-11-24 08:43:07,159 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:writeBreadCrumbNode(883)) - Writing znode 
> /hadoop-ha/ns/ActiveBreadCrumb to indicate that the local node is the most 
> recent active...
> 2019-11-24 08:43:07,212 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:becomeActive(380)) - Trying to make NameNode at 
> zk2/52.82.32.47.2:8020 active...
> 2019-11-24 08:43:11,918 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:becomeActive(387)) - Successfully transitioned 
> NameNode at zk2/52.82.32.47.2:8020 to active state
> 2019-11-24 09:00:32,482 WARN ha.HealthMonitor 
> (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying 
> to monitor health of NameNode at zk2/52.82.32.47.2:8020: java.io.IOException: 
> Connection reset by peer Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: "zk2/52.82.32.47.2"; 
> destination host is: "zk2":8020;
> 2019-11-24 09:00:32,482 INFO ha.HealthMonitor 
> (HealthMonitor.java:enterState(249)) - Entering state SERVICE_NOT_RESPONDING
> 2019-11-24 09:00:32,482 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:setLastHealthState(851)) - Local service NameNode 
> at zk2/52.82.32.47.2:8020 entered state: SERVICE_NOT_RESPONDING
> 2019-11-24 09:00:32,483 WARN tools.DFSZKFailoverController 
> (DFSZKFailoverController.java:getLocalNNThreadDump(244)) - Can't get local NN 
> thread dump due to Connection refused (Connection refused)
> 2019-11-24 09:00:32,483 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:recheckElectability(768)) - Quitting master 
> election for NameNode at zk2/52.82.32.47.2:8020 and marking that fencing is 
> necessary
> 2019-11-24 09:00:32,483 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:quitElection(406)) - Yielding from election
> 2019-11-24 09:00:32,483 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK 
> connection for elector id=324457684 
> appData=0a026e7312036e6e311a037a6b3220d43e28d33e cb=Elector callbacks for 
> NameNode at zk2/52.82.32.47.2:8020
> 2019-11-24 09:00:32,514 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) 
> - Session: 0x169c07336613aab closed
> 2019-11-24 09:00:32,581 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
> old client with sessionId 0x169c07336613aab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to