[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

Uma Maheswara Rao G (JIRA) Wed, 16 May 2012 03:57:31 -0700

    [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276658#comment-13276658
 ]


Uma Maheswara Rao G commented on BOOKKEEPER-253:
------------------------------------------------

@Ivan, 

{quote}
There is one znode, the write permission znode, /journal/writeLock
When a node wants to start writing, it must read the znode to see what the 
current inprogress_znode is. At this point it saves the version of the 
writeLock znode. It then recovers the inprogress_znode, which will fence the 
ledger which it is using. It creates its own ledger, and then writes the new 
inprogress_znode to writeLock, using the version it previously saved.
If another node has tried to start writing before this, the version will have 
changed, so the write will fail. 
{quote}
I am not sure, I followed you correctly.
This is waht i understood.
When NN2 tries to become active where NN1 already acting as active, it will 
have new version id in ZK and do the ledger recoveries. Finally have the 
comparision check with his saved versionid before proceeding for write.

In between, if NN1 also recovering and creating new ledger, he might have diff 
versionid and after recovery version id might changed by NN2, it will fail.
                
> BKJM:Switch from standby to active fails and NN gets shut down due to delay 
> in clearing of lock
> -----------------------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-253
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-253
>             Project: Bookkeeper
>          Issue Type: Bug
>          Components: bookkeeper-client
>            Reporter: suja s
>            Assignee: Uma Maheswara Rao G
>            Priority: Blocker
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
> 5000. By the time control comes to acquire lock the previous lock is not 
> released which leads to failure in lock acquisition by NN and NN gets 
> shutdown. Ideally it should have been done)
> =============================================================================
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
> Failed to acquire lock with /ledgers/lock/lock-0000000007, lock-0000000006 
> already has it
> 2012-05-09 20:15:29,732 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
>  stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

Reply via email to