[jira] [Updated] (HADOOP-16724) namenodes in our cluster often have their states changed from active to standby, the ERROR log is IPC's epoch 187 is less than the last promised epoch 188

wupeng (Jira) Sun, 24 Nov 2019 22:27:45 -0800


     [ 
https://issues.apache.org/jira/browse/HADOOP-16724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


wupeng updated HADOOP-16724:
----------------------------
    Summary: namenodes in our cluster often have their states changed from 
active to standby, the ERROR log is IPC's epoch 187 is less than the last 
promised epoch 188  (was: namenodes in our cluster often have their states 
changed from active to standby., the ERROR log is IPC's epoch 187 is less than 
the last promised epoch 188)

> namenodes in our cluster often have their states changed from active to 
> standby, the ERROR log is IPC's epoch 187 is less than the last promised 
> epoch 188
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16724
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16724
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 2.7.3
>            Reporter: wupeng
>            Priority: Major
>
> Recently, namenodes in our cluster often have their states changed from 
> active to standby.
> Below is the log:
> active namenode :
> Serving checkpoints at http://zk14:50070
> 2019-11-24 08:43:09,765 WARN client.QuorumJournalManager 
> (IPCLoggerChannel.java:call(406)) - Took 17727ms to send a batch of 1 edits 
> (208 bytes) to remote journal 52.82.32.47.2:8485
> 2019-11-24 08:43:10,006 WARN client.QuorumJournalManager 
> (IPCLoggerChannel.java:call(388)) - Remote journal 52.82.32.47.2:8485 failed 
> to write txns 1942183681-1942183681. Will try to write to this JN again after 
> the next log roll.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 187 
> is less than the last promised epoch 188
>  at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:428)
>  at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:456)
>  at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:351)
>  at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:152)
>  at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:158)
>  at 
> org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25421)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1554)
>  at org.apache.hadoop.ipc.Client.call(Client.java:1498)
>  at org.apache.hadoop.ipc.Client.call(Client.java:1398)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>  at com.sun.proxy.$Proxy11.journal(Unknown Source)
>  at 
> org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:167)
>  at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:385)
>  at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:378)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-11-24 08:43:19,695 INFO namenode.TransferFsImage 
> (TransferFsImage.java:receiveFile(575)) - Combined time for fsimage download 
> and fsync to all disks took 9.35s. The fsimage download took 9.35s at 
> 156316.01 KB/s. Synchronous (fsync) write to disk of 
> /hadoop/hdfs/namenode/current/fsimage.ckpt_0000000001942183423 took 0.00s. 
> Synchronous (fsync) write to disk of 
> /data10_4T/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000001942183423 took 
> 0.00s.
> 2019-11-24 08:43:19,695 INFO namenode.TransferFsImage 
> (TransferFsImage.java:handleUploadImageRequest(135)) - Downloaded file 
> fsimage.ckpt_0000000001942183423 size 1496792686 bytes.
> 2019-11-24 08:43:19,807 INFO namenode.NNStorageRetentionManager 
> (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 
> 2 images with txid >= 1941733640
> master zkfc:
> 2019-11-24 08:43:04,352 INFO zookeeper.ClientCnxn 
> (ClientCnxn.java:onConnected(1279)) - Session establishment complete on 
> server zk14/52.82.32.47.14:2181, sessionid = 0x269c073377c56e4, negotiated 
> timeout = 20000
> 2019-11-24 08:43:06,557 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(868)) - Exception handling the 
> winning of election
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
> BadVersion for /hadoop-ha/ns/ActiveBreadCrumb
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>  at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector$5.run(ActiveStandbyElector.java:1025)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector$5.run(ActiveStandbyElector.java:1022)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1067)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1059)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.setDataWithRetries(ActiveStandbyElector.java:1022)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.writeBreadCrumbNode(ActiveStandbyElector.java:891)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:861)
>  at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:534)
>  at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
>  at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> 2019-11-24 08:43:06,557 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:reJoinElection(723)) - Trying to re-establish ZK 
> session
> 2019-11-24 08:43:06,557 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK 
> connection for elector id=324457684 
> appData=0a026e7312036e6e321a047a6b313420d43e28d33e cb=Elector callbacks for 
> NameNode at zk14/52.82.32.47.14:8020
> 2019-11-24 08:43:06,605 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) 
> - Session: 0x269c073377c56e4 closed
> 2019-11-24 08:43:07,605 INFO zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) 
> - Initiating client connection, connectString=zk14:2181,zk2:2181,zk1:2181 
> sessionTimeout=20000 
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@538e9048
> 2019-11-24 08:43:07,606 INFO zookeeper.ClientCnxn 
> (ClientCnxn.java:logStartConnect(1019)) - Opening socket connection to server 
> zk14/52.82.32.47.14:2181. Will not attempt to authenticate using SASL 
> (unknown error)
> 2019-11-24 08:43:07,607 INFO zookeeper.ClientCnxn 
> (ClientCnxn.java:primeConnection(864)) - Socket connection established to 
> zk14/52.82.32.47.14:2181, initiating session
> 2019-11-24 08:43:07,631 INFO zookeeper.ClientCnxn 
> (ClientCnxn.java:onConnected(1279)) - Session establishment complete on 
> server zk14/52.82.32.47.14:2181, sessionid = 0x269c073377c57c6, negotiated 
> timeout = 20000
> 2019-11-24 08:43:07,631 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:processWatchEvent(600)) - Session connected.
> 2019-11-24 08:43:07,631 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
> old client with sessionId 0x269c073377c56e4
> 2019-11-24 08:43:07,632 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
> old client with sessionId 0x269c073377c56e4
> 2019-11-24 08:43:07,632 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
> old client with sessionId 0x269c073377c56e4
> follower zkfc:
> 2019-11-24 08:43:06,577 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:becomeStandby(484)) - ZK Election indicated that 
> NameNode at zk2/52.82.32.47.2:8020 should become standby
> 2019-11-24 08:43:06,579 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:becomeStandby(489)) - Successfully transitioned 
> NameNode at zk2/52.82.32.47.2:8020 to standby state
> 2019-11-24 08:43:06,613 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:fenceOldActive(936)) - Checking for any old active 
> which needs to be fenced...
> 2019-11-24 08:43:06,613 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:fenceOldActive(957)) - Old node exists: 
> 0a026e7312036e6e321a047a6b313420d43e28d33e
> 2019-11-24 08:43:06,613 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:doFence(513)) - Should fence: NameNode at 
> zk14/52.82.32.47.14:8020
> 2019-11-24 08:43:07,159 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:doFence(519)) - Successfully transitioned NameNode 
> at zk14/52.82.32.47.14:8020 to standby state without fencing
> 2019-11-24 08:43:07,159 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:writeBreadCrumbNode(883)) - Writing znode 
> /hadoop-ha/ns/ActiveBreadCrumb to indicate that the local node is the most 
> recent active...
> 2019-11-24 08:43:07,212 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:becomeActive(380)) - Trying to make NameNode at 
> zk2/52.82.32.47.2:8020 active...
> 2019-11-24 08:43:11,918 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:becomeActive(387)) - Successfully transitioned 
> NameNode at zk2/52.82.32.47.2:8020 to active state
> 2019-11-24 09:00:32,482 WARN ha.HealthMonitor 
> (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying 
> to monitor health of NameNode at zk2/52.82.32.47.2:8020: java.io.IOException: 
> Connection reset by peer Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: "zk2/52.82.32.47.2"; 
> destination host is: "zk2":8020;
> 2019-11-24 09:00:32,482 INFO ha.HealthMonitor 
> (HealthMonitor.java:enterState(249)) - Entering state SERVICE_NOT_RESPONDING
> 2019-11-24 09:00:32,482 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:setLastHealthState(851)) - Local service NameNode 
> at zk2/52.82.32.47.2:8020 entered state: SERVICE_NOT_RESPONDING
> 2019-11-24 09:00:32,483 WARN tools.DFSZKFailoverController 
> (DFSZKFailoverController.java:getLocalNNThreadDump(244)) - Can't get local NN 
> thread dump due to Connection refused (Connection refused)
> 2019-11-24 09:00:32,483 INFO ha.ZKFailoverController 
> (ZKFailoverController.java:recheckElectability(768)) - Quitting master 
> election for NameNode at zk2/52.82.32.47.2:8020 and marking that fencing is 
> necessary
> 2019-11-24 09:00:32,483 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:quitElection(406)) - Yielding from election
> 2019-11-24 09:00:32,483 INFO ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:terminateConnection(835)) - Terminating ZK 
> connection for elector id=324457684 
> appData=0a026e7312036e6e311a037a6b3220d43e28d33e cb=Elector callbacks for 
> NameNode at zk2/52.82.32.47.2:8020
> 2019-11-24 09:00:32,514 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) 
> - Session: 0x169c07336613aab closed
> 2019-11-24 09:00:32,581 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:isStaleClient(1093)) - Ignoring stale result from 
> old client with sessionId 0x169c07336613aab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-16724) namenodes in our cluster often have their states changed from active to standby, the ERROR log is IPC's epoch 187 is less than the last promised epoch 188

Reply via email to