-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20625/
-----------------------------------------------------------

Review request for qpid, Gordon Sim and Kenneth Giusti.


Bugs: QPID-5719
    https://issues.apache.org/jira/browse/QPID-5719


Repository: qpid


Description
-------

QPID-5719: HA becomes unresponsive once any of the brokers are SIGSTOPed

- Added timeout to qpid-ha.
- qpidd init script pings broker to verify it is not hung.
- updated documentation in 
qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml.

The new results for the cases mentioned in the bug:

a] stopped ALL brokers: rgmanager restarts the entire cluster but data is lost.
   Equivalent to killing all the  brokers at once. This does not affect quorum 
because
   only qpidd services are affected, not other services managed by cman.

b] stopped the primary: rgmanager restarts the primary after a timeout and 
promotes one of the backups.

c] stopped a backup: rgmanager restarts the backups after a timeout.
   Clients that are actively sending messages may see a delay while backup is 
restarted.

Note you need to set link-heartbeat-interval in qpidd.conf. The default is very
high (120 seconds), it should be set lower to see recovery from sigstop in a
reasonable time.
See the updated documentation in 
qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml.


Diffs
-----

  /trunk/qpid/cpp/etc/qpidd-primary.in 1589403 
  /trunk/qpid/cpp/etc/qpidd.in 1589403 
  /trunk/qpid/cpp/src/tests/ha_test.py 1589403 
  /trunk/qpid/cpp/src/tests/ha_tests.py 1589403 
  /trunk/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml 1589403 
  /trunk/qpid/tools/src/py/qpid-ha 1589403 

Diff: https://reviews.apache.org/r/20625/diff/


Testing
-------

Tested with 3 node cman cluster, passes full ctest.


Thanks,

Alan Conway

Reply via email to