[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

Uma Maheswara Rao G (JIRA) Tue, 15 May 2012 21:40:45 -0700

    [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276452#comment-13276452
 ]


Uma Maheswara Rao G commented on BOOKKEEPER-253:
------------------------------------------------

Hi Ivan, Thanks a lot for taking a look.

{quote}
The NN1 should release the lock when it finalizes it's current segment. 
FSEditLog#close calls endCurrentLogSegment which calls finalizeSegment on the 
journalSet.
{quote}
What is my point addressing is, before NN1 finalizeSegment itself, NN2 can 
become active and will get shutdown, because lock has not released by it's peer 
node.


{quote}
There was another JIRA, HDFS-3386 about a similar issue. Perhaps what you are 
seeing is another manifestation of that.
{quote}
Yes, that is similar, but different issue. With our proposed fix, this also 
should get addressed.


{quote}
Also, I agree with harmonising with the ZKFC. If the ZKFC is being used, we 
should warn if the configured timeout is higher than ZKFC timeout. If it is not 
configured, we should default to 90% of ZKFC timeout, or so.
{quote}
But the problem is, ZKFC and NN are different processes. It is not necessarly 
true that ZKFC configurations will be available in NN also. So, here only the 
option is to document.
                
> BKJM:Switch from standby to active fails and NN gets shut down due to delay 
> in clearing of lock
> -----------------------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-253
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-253
>             Project: Bookkeeper
>          Issue Type: Bug
>          Components: bookkeeper-client
>            Reporter: suja s
>            Assignee: Uma Maheswara Rao G
>            Priority: Blocker
>
> Normal switch fails. 
> (BKjournalManager zk session timeout is 3000 and ZKFC session timeout is 
> 5000. By the time control comes to acquire lock the previous lock is not 
> released which leads to failure in lock acquisition by NN and NN gets 
> shutdown. Ideally it should have been done)
> =============================================================================
> 2012-05-09 20:15:29,732 ERROR org.apache.hadoop.contrib.bkjournal.WriteLock: 
> Failed to acquire lock with /ledgers/lock/lock-0000000007, lock-0000000006 
> already has it
> 2012-05-09 20:15:29,732 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager@412beeec,
>  stream=null))
> java.io.IOException: Could not acquire lock
> at org.apache.hadoop.contrib.bkjournal.WriteLock.acquire(WriteLock.java:107)
> at 
> org.apache.hadoop.contrib.bkjournal.BookKeeperJournalManager.recoverUnfinalizedSegments(BookKeeperJournalManager.java:406)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:551)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:322)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:548)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1134)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:598)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
> at 
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> at 
> org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
> 2012-05-09 20:15:29,736 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG: 
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at HOST-XX-XX-XX-XX/XX.XX.XX.XX
> Scenario:
> Start ZKFCS, NNs
> NN1 is active and NN2 is standby
> Stop NN1. NN2 tries to transition to active and gets shut down

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (BOOKKEEPER-253) BKJM:Switch from standby to active fails and NN gets shut down due to delay in clearing of lock

Reply via email to