[
https://issues.apache.org/jira/browse/QPID-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13980020#comment-13980020
]
ASF subversion and git services commented on QPID-5719:
-------------------------------------------------------
Commit 1589807 from [~aconway] in branch 'qpid/trunk'
[ https://svn.apache.org/r1589807 ]
QPID-5719: HA becomes unresponsive once any of the brokers are SIGSTOPed
- Added timeout to qpid-ha.
- qpidd init script pings broker to verify it is not hung.
- updated documentation in
qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml.
The new results for the cases mentioned in the bug:
a] stopped ALL brokers: rgmanager restarts the entire cluster but data is lost.
Equivalent to killing all the brokers at once. This does not affect quorum
because
only qpidd services are affected, not other services managed by cman.
b] stopped the primary: rgmanager restarts the primary after a timeout and
promotes one of the backups.
c] stopped a backup: rgmanager restarts the backups after a timeout.
Clients that are actively sending messages may see a delay while backup is
restarted.
Note you need to set link-heartbeat-interval in qpidd.conf. The default is very
high (120 seconds), it should be set lower to see recovery from sigstop in a
reasonable time.
See the updated documentation in
qpid/doc/book/src/cpp-broker/Active-Passive-Cluster.xml.
> HA becomes unresponsive once any of the brokers are SIGSTOPed
> -------------------------------------------------------------
>
> Key: QPID-5719
> URL: https://issues.apache.org/jira/browse/QPID-5719
> Project: Qpid
> Issue Type: Bug
> Components: C++ Clustering
> Affects Versions: 0.28
> Reporter: Alan Conway
> Assignee: Alan Conway
> Attachments: ha-heartbeat.diff
>
>
> See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638
> Description of problem:
> qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.
> There are three different cases:
> a] stopped ALL brokers
> b] stopped the primary
> c] stopped a backup
> In any of above listed cases following observations were made:
> a-c] RHCS clustat is just fine and report everything is just ok
> a-c] qpid-ha (status --all) hangs
> a,b,c*] any other clients are indefinitely blocked
> a-b] cases directly at the beginning
> c] case at the end, client able to recover after minute or so,
> due to connection timeout
> In fact this defect also proves that qpid-ha can be out of sync when compared
> to clustat as tracked by BZ.
> The expectations are:
> * a] quorum lost HA down (same as kill -9 to all nodes)
> no clients able to communicate
> * b] promotion of new primary, there has to be mechanism to get rid of
> stopped process
> clients should be able to communicate after recovery
> * c] unresponsive backup should get restarted
> clients should be able to communicate after duration when backup is
> detected as unresponsive
> * Generally better integration Qpid HA environment <-> RHCS is needed
> aka SIGSTOP detection
> * Heartbeat primary <-> backups probably needed
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]