[
https://issues.apache.org/jira/browse/HDFS-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508819#comment-14508819
]
XingFeng Shen commented on HDFS-8221:
-------------------------------------
thank you, i will watch this jira HDFS-8161.
> HDFS have two Standby NNs because ActiveStandbyElectorLock ephemeralOwner in
> ZK is different with the sessionId stored in ZKFC
> ------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-8221
> URL: https://issues.apache.org/jira/browse/HDFS-8221
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: auto-failover
> Affects Versions: 2.4.1
> Reporter: XingFeng Shen
>
> Firstly, NN1 is active, NN2 is standby. When NN1 become standby due to some
> reasons, NN2 then take over the active state imediately. But after NN2
> becoming active, It changed to standby again. And, HDFS got two standby NN
> forever.
> After check the log, I found that NN2 become standby beacuse It have wrong
> sessionID with ActiveStandbyElectorLock ephemeralOwner stored in Znode.
> And the rootcause is when NN1 go to standby, NN2 create one session A with
> zk, and become active. Ideally, NN2 should have the same sessionID with
> ActiveStandbyElectorLock ephemeralOwner stored in Znode, but some network
> reason can result in NN2'ZKFC sessionID changed.
> So, I think when NN2 become standby due to different sessionid, NN2 should
> unlock the state in Znode in order to failover again.
> ActiveStandyElector.processResult
> ==================
> Code code = Code.get(rc);
> if (isSuccess(code)) {
> // the following owner check completes verification in case the lock
> znode
> // creation was retried
> if (stat.getEphemeralOwner() == zkClient.getSessionId()) {
> // we own the lock znode. so we are the leader
> if (!becomeActive()) {
> reJoinElectionAfterFailureToBecomeActive();
> }
> } else {
> // we dont own the lock znode. so we are a standby.
> becomeStandby();
> }
> // the watch set by us will notify about changes
> return;
> }
> ActiveStandbyElectorLock content
> ==================
> [zk: 160.149.0.114:24002(CONNECTED) 1] get
> /hadoop-ha/hacluster/ActiveStandbyElectorLock
> 160-149-0-117 锟斤拷(锟斤拷
> cZxid = 0x2000a38d9
> ctime = Thu Apr 16 11:32:54 CST 2015
> mZxid = 0x2000a38d9
> mtime = Thu Apr 16 11:32:54 CST 2015
> pZxid = 0x2000a38d9
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x164cb2b3e4b36ae4
> dataLength = 38
> numChildren = 0
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)