[
https://issues.apache.org/jira/browse/BOOKKEEPER-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279474#comment-13279474
]
Uma Maheswara Rao G commented on BOOKKEEPER-253:
------------------------------------------------
@Ivan, Updated very basic patch. Did not includ the tests in this patch. This
patch is just for checking the approach.
I will include the tests in next version of patch.
Now I could not find any reason for adding the permission lock in
recoverUnfinalizedSegments. Actually we can track this version check while
creating ledger itself. While startLogSegment, we will set the permission data
and get the version number. After creating the ledger, we will check the
permissions version number by setting the data again with the previously saved
version number as you proposed previously.
We have verified this with ZKFC and manual failover modes. Working well. Still
I am trying to find the gaps. Now I have uploaded this basic version of patch.
You can provide your feedback on approach.
Thanks
Uma
> BKJM:Switch from standby to active fails and NN gets shut down due to delay
> in clearing of lock
> -----------------------------------------------------------------------------------------------
>
> Key: BOOKKEEPER-253
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-253
> Project: Bookkeeper
> Issue Type: Bug
> Components: bookkeeper-client
> Reporter: suja s
> Assignee: Uma Maheswara Rao G
> Priority: Blocker
> Attachments: BK-253-BKJM.patch
>
>
> Normal switch fails.
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is
> 5000. By the time control comes to acquire lock the previous lock is not
> released which leads to failure in lock acquisition by NN and NN gets
> shutdown. Ideally it should have been done)
> =============================================================================
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock:
> Failed to acquire lock with /ledgers/lock/lock-0000000007, lock-0000000006
> already has it
> 2012-05-09 20:15:29,732 FATAL
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error:
> recoverUnfinalizedSegments failed for required journal
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
> stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
> SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira