Alex Rudyy created QPID-8576:
--------------------------------

             Summary: BDB HA Virtual Host start-up can hang when majority is 
lost during message store recovery operations
                 Key: QPID-8576
                 URL: https://issues.apache.org/jira/browse/QPID-8576
             Project: Qpid
          Issue Type: Bug
          Components: Broker-J
    Affects Versions: qpid-java-broker-8.0.6
            Reporter: Alex Rudyy


There is a risk of BDB HA Virtual Host start-up hanging when replica majority 
is lost during message store recovery operations. The stack trace like below is 
reported for the issue

{noformat}
2022-02-07 11:08:42,086 INFO  [VirtualHostNode-node1-Config] 
(q.m.h.quorum_lost) - [Broker] [grp(/myha)/vhn(/node1)] HA-1009 : Insufficient 
replicas contactable
2022-02-07 11:08:42,087 ERROR [VirtualHostNode-node1-Config] 
(o.a.q.s.m.AbstractConfiguredObject) - Failed to open object with name 'myha'.  
Object will be put into ERROR state.
org.apache.qpid.server.util.ConnectionScopedRuntimeException: Required number 
of nodes not reachable
        at 
org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.handleDatabaseException(ReplicatedEnvironmentFacade.java:496)
        at 
org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.commit(ReplicatedEnvironmentFacade.java:333)
        at 
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.removeMessage(AbstractBDBMessageStore.java:354)
        at 
org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore$StoredBDBMessage.remove(AbstractBDBMessageStore.java:1205)
        at 
org.apache.qpid.server.virtualhost.SynchronousMessageStoreRecoverer.recover(SynchronousMessageStoreRecoverer.java:136)
        at 
org.apache.qpid.server.virtualhost.AbstractVirtualHost.postCreateDefaultExchangeTasks(AbstractVirtualHost.java:2811)
        at 
org.apache.qpid.server.virtualhost.AbstractVirtualHost.onActivate(AbstractVirtualHost.java:2757)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1526)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject.attainState(AbstractConfiguredObject.java:1505)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1072)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject$8.onSuccess(AbstractConfiguredObject.java:1066)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject$24$1.run(AbstractConfiguredObject.java:2894)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject$24$1.run(AbstractConfiguredObject.java:2890)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject$24.onSuccess(AbstractConfiguredObject.java:2889)
        at 
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1078)
        at 
org.apache.qpid.server.configuration.updater.TaskExecutorImpl$ImmediateIfSameThreadExecutor.execute(TaskExecutorImpl.java:400)
        at 
org.apache.qpid.server.configuration.updater.TaskExecutorImpl.execute(TaskExecutorImpl.java:183)
        at 
com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1164)
        at 
com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:708)
        at 
com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:113)
        at 
com.google.common.util.concurrent.Futures.addCallback(Futures.java:1045)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject.addFutureCallback(AbstractConfiguredObject.java:2884)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject.doAttainState(AbstractConfiguredObject.java:1065)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject.access$600(AbstractConfiguredObject.java:97)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:591)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject$1.execute(AbstractConfiguredObject.java:578)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:639)
        at 
org.apache.qpid.server.model.AbstractConfiguredObject$2.execute(AbstractConfiguredObject.java:632)
        at 
org.apache.qpid.server.configuration.updater.TaskExecutorImpl$TaskLoggingWrapper.execute(TaskExecutorImpl.java:248)
        at 
org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper$1.run(TaskExecutorImpl.java:320)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at 
org.apache.qpid.server.configuration.updater.TaskExecutorImpl$CallableWrapper.call(TaskExecutorImpl.java:313)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.sleepycat.je.rep.InsufficientAcksException: (JE 7.4.5) 
Transaction: -4128977263  VLSN: 12,383,163,414, initiated at: 11:08:27.  
Insufficient acks for policy:SIMPLE_MAJORITY. Need replica acks: 3. Missing 
replica acks: 1. Timeout: 15000ms. FeederState=node1(4)[UNKNOWN]
No feeders.

        at 
com.sleepycat.je.rep.impl.node.DurabilityQuorum.ensureSufficientAcks(DurabilityQuorum.java:205)
        at 
com.sleepycat.je.rep.stream.FeederTxns.awaitReplicaAcks(FeederTxns.java:189)
        at 
com.sleepycat.je.rep.impl.RepImpl.postLogCommitHookInternal(RepImpl.java:1426)
        at 
com.sleepycat.je.rep.impl.RepImpl.postLogCommitHook(RepImpl.java:1385)
        at 
com.sleepycat.je.rep.txn.MasterTxn.postLogCommitHook(MasterTxn.java:228)
        at com.sleepycat.je.txn.Txn.commit(Txn.java:772)
        at com.sleepycat.je.Transaction.doCommit(Transaction.java:621)
        at com.sleepycat.je.Transaction.commit(Transaction.java:401)
        at 
org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade.commit(ReplicatedEnvironmentFacade.java:329)
        ... 44 common frames omitted
2022-02-07 11:08:42,096 INFO  [UNKNOWN node1(4)] 
(o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - The node 'myha:node1' state is 
REPLICA
2022-02-07 11:08:43,704 INFO  [StateChange-myha:node1] (q.m.h.role_changed) - 
[Broker] [grp(/myha)] HA-1010 : Role change reported: Node : 'node1' 
(iaasn00348250:5000) : from 'WAITING' to 'MASTER'
2022-02-07 11:08:43,704 INFO  [StateChange-myha:node1] 
(o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - The environment facade is in open 
state for node myha:node1
2022-02-07 11:08:43,704 ERROR [Environment-myha:node1] 
(o.a.q.s.s.b.r.ReplicatedEnvironmentFacade) - Failed to restart environment.

{noformat}

The VH needs to handle a majority loss gracefully by restarting the environment 
in order to rejoin the cluster



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

Reply via email to