-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20625/
-----------------------------------------------------------
(Updated April 24, 2014, 4:17 p.m.)
Review request for qpid, Gordon Sim and Kenneth Giusti.
Changes
-------
Removed stupid print "FIXME" left in by mistake (didn't get to trunk, phew!)
Bugs: QPID-5719
https://issues.apache.org/jira/browse/QPID-5719
Repository: qpid
Description
-------
QPID-5719: HA becomes unresponsive once any of the brokers are SIGSTOPed
- Added timeout to qpid-ha.
- qpidd init script pings broker to verify it is not hung.
- updated documentation in
qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml.
The new results for the cases mentioned in the bug:
a] stopped ALL brokers: rgmanager restarts the entire cluster but data is lost.
Equivalent to killing all the brokers at once. This does not affect quorum
because
only qpidd services are affected, not other services managed by cman.
b] stopped the primary: rgmanager restarts the primary after a timeout and
promotes one of the backups.
c] stopped a backup: rgmanager restarts the backups after a timeout.
Clients that are actively sending messages may see a delay while backup is
restarted.
Note you need to set link-heartbeat-interval in qpidd.conf. The default is very
high (120 seconds), it should be set lower to see recovery from sigstop in a
reasonable time.
See the updated documentation in
qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml.
Diffs (updated)
-----
/trunk/qpid/cpp/etc/qpidd-primary.in 1589403
/trunk/qpid/cpp/etc/qpidd.in 1589403
/trunk/qpid/cpp/src/tests/ha_test.py 1589403
/trunk/qpid/cpp/src/tests/ha_tests.py 1589403
/trunk/qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml 1589403
/trunk/qpid/tools/src/py/qpid-ha 1589403
Diff: https://reviews.apache.org/r/20625/diff/
Testing
-------
Tested with 3 node cman cluster, passes full ctest.
Thanks,
Alan Conway