[jira] [Updated] (QPID-6972) BDB HA: Node may remain detached from group following loss of quorum

Keith Wall (JIRA) Wed, 06 Jan 2016 09:54:18 -0800

     [ 
https://issues.apache.org/jira/browse/QPID-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Keith Wall updated QPID-6972:
-----------------------------
    Description: 
If a master detects that it has lost quorum (which may occur owing to a user 
generated transaction, or an internally generated 'ping' transaction, failing 
to see the required number of replica acknowledgements), the underlying JE 
environment {{ReplicatedEnvironment}} is automatically restarted (the old one 
closed and a new one created to replace it).   This approach ensures that 
clients reconnect to a new master in a timely way.

There is a coding error in the CoalescingCommitter that means that the JE 
environment restart may not complete properly.  If quorum disappears whilst 
there are jobs on the CoalescingCommitter's job queue, the  
CoalescingCommitter's error handling will cause the BDB EnvironmentFacade to be 
closed.   This is okay for the BDB non-HA case as such an exception is always 
fatal, but for HA, calling {{ReplicatedEnvironmentFacade#close()}} prevents the 
environment from being recreated.

This effect of this defect is that a node may disappear from the group every 
time quorum is temporarily lost.  This will keep occuring until quorum no 
longer remains, at which point the business will stop.  Bouncing the affected 
brokers (or restarting the VHNs) will restore the service, without message loss.



  

  was:
If a master detects that it has lost quorum (which may occur owing to a user 
generated transaction, or an internally generated 'ping' transaction, failing 
to see the required number of replica acknowledgements), the underlying JE 
environment {{ReplicatedEnvironment}} is automatically restarted (the old one 
closed and a new one created to replace it).   This approach ensures that 
clients reconnect to a new master in a timely way.

There is a coding error in the CoalescingCommitter that means that the JE 
environment restart may not complete properly.  If quorum disappears whilst 
there are jobs on the CoalescingCommitter's job queue, the  
CoalescingCommitter's error handling will cause the BDB EnvironmentFacade to be 
closed.   This is okay for the BDB non-HA case as such an exception is always 
fatal, but for HA, calling {{ReplicatedEnvironmentFacade#close()}} the prevents 
the environment from being recreated.

This effect of this defect is that a node may disappear from the group every 
time quorum is temporarily lost.  This will keep occuring until quorum no 
longer remains, at which point the business will stop.  Bouncing the affected 
brokers (or restarting the VHNs) will restore the service, without message loss.



  


> BDB HA: Node may remain detached from group following loss of quorum
> --------------------------------------------------------------------
>
>                 Key: QPID-6972
>                 URL: https://issues.apache.org/jira/browse/QPID-6972
>             Project: Qpid
>          Issue Type: Bug
>          Components: Java Broker
>    Affects Versions: 0.30, 0.32, qpid-java-6.0
>            Reporter: Keith Wall
>
> If a master detects that it has lost quorum (which may occur owing to a user 
> generated transaction, or an internally generated 'ping' transaction, failing 
> to see the required number of replica acknowledgements), the underlying JE 
> environment {{ReplicatedEnvironment}} is automatically restarted (the old one 
> closed and a new one created to replace it).   This approach ensures that 
> clients reconnect to a new master in a timely way.
> There is a coding error in the CoalescingCommitter that means that the JE 
> environment restart may not complete properly.  If quorum disappears whilst 
> there are jobs on the CoalescingCommitter's job queue, the  
> CoalescingCommitter's error handling will cause the BDB EnvironmentFacade to 
> be closed.   This is okay for the BDB non-HA case as such an exception is 
> always fatal, but for HA, calling {{ReplicatedEnvironmentFacade#close()}} 
> prevents the environment from being recreated.
> This effect of this defect is that a node may disappear from the group every 
> time quorum is temporarily lost.  This will keep occuring until quorum no 
> longer remains, at which point the business will stop.  Bouncing the affected 
> brokers (or restarting the VHNs) will restore the service, without message 
> loss.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (QPID-6972) BDB HA: Node may remain detached from group following loss of quorum

Reply via email to