Alan Conway created QPID-5719:
---------------------------------
Summary: HA becomes unresponsive once any of the brokers are
SIGSTOPed
Key: QPID-5719
URL: https://issues.apache.org/jira/browse/QPID-5719
Project: Qpid
Issue Type: Bug
Components: C++ Clustering
Affects Versions: 0.28
Reporter: Alan Conway
Assignee: Alan Conway
See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638
Description of problem:
qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.
There are three different cases:
a] stopped ALL brokers
b] stopped the primary
c] stopped a backup
In any of above listed cases following observations were made:
a-c] RHCS clustat is just fine and report everything is just ok
a-c] qpid-ha (status --all) hangs
a,b,c*] any other clients are indefinitely blocked
a-b] cases directly at the beginning
c] case at the end, client able to recover after minute or so,
due to connection timeout
In fact this defect also proves that qpid-ha can be out of sync when compared
to clustat as tracked by BZ.
The expectations are:
* a] quorum lost HA down (same as kill -9 to all nodes)
no clients able to communicate
* b] promotion of new primary, there has to be mechanism to get rid of stopped
process
clients should be able to communicate after recovery
* c] unresponsive backup should get restarted
clients should be able to communicate after duration when backup is
detected as unresponsive
* Generally better integration Qpid HA environment <-> RHCS is needed
aka SIGSTOP detection
* Heartbeat primary <-> backups probably needed
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]