Keith Wall created QPID-6972:
--------------------------------
Summary: BDB HA: Node may remain detached from group following
loss of quorum
Key: QPID-6972
URL: https://issues.apache.org/jira/browse/QPID-6972
Project: Qpid
Issue Type: Bug
Components: Java Broker
Affects Versions: qpid-java-6.0, 0.32, 0.30
Reporter: Keith Wall
If a master detects that it has lost quorum (which may occur owing to a user
generated transaction, or an internally generated 'ping' transaction, failing
to see the required number of replica acknowledgements), the underlying JE
environment {{ReplicatedEnvironment}} is automatically restarted (the old one
closed and a new one created to replace it). This approach ensures that
clients reconnect to a new master in a timely way.
There is a coding error in the CoalescingCommitter that means that the JE
environment restart may not complete properly. If quorum disappears whilst
there are jobs on the CoalescingCommitter's job queue, the
CoalescingCommitter's error handling will cause the BDB EnvironmentFacade to be
closed. This is okay for the BDB non-HA case as such an exception is always
fatal, but for HA, calling {{ReplicatedEnvironmentFacade#close()) the prevents
the environment from being recreated.
This effect of this defect is that a node may disappear from the group every
time quorum is temporarily lost. This will keep occuring until quorum no
longer remains, at which point the business will stop. Bouncing the affected
brokers (or restarting the VHNs) will restore the service, without message loss.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]