[
https://issues.apache.org/jira/browse/QPID-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Conway reopened QPID-5719:
-------------------------------
Frantisek Reznicek 2014-07-18 08:44:38 EDT
There is great improvement in behavior, but still not completely there.
Neither stopping primary nor backup causes HA cluster to hang the operation now.
Stopping primary causes primary to be restarted, but stopping backup causes
backup to stay stopped and not return to qpid HA as backup.
Testing details:
a] [PASS] stopped ALL brokers
Whole cluster is restarted and ending in joining state which is documented.
This scenario we agreed we cannot handle.
b] [PASS] stopped the primary
Primary hang is detected and restarted as backup and another ready backup is
promoted as primary.
c] [FAIL] stopped a backup !!!
Backup stays in stopped state forever (tested couple of minutes with
link-heartbeat-interval=2). After such time kill -CONT causes backup to get
back to cluster which is also dangerous as can be completely out-of-sync.
d] [PASS] killed-9 ALL brokers
Whole cluster is restarted and ending in joining state which is documented.
This scenario we agreed we cannot handle.
e] [PASS] killed-9 the primary
Primary is restarted as backup and another ready backup is promoted as
primary.
f] [PASS] killed-9 a backup
Backup is restarted as backup.
> HA becomes unresponsive once any of the brokers are SIGSTOPed
> -------------------------------------------------------------
>
> Key: QPID-5719
> URL: https://issues.apache.org/jira/browse/QPID-5719
> Project: Qpid
> Issue Type: Bug
> Components: C++ Clustering
> Affects Versions: 0.28
> Reporter: Alan Conway
> Assignee: Alan Conway
> Fix For: 0.29
>
> Attachments: ha-heartbeat.diff
>
>
> See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638
> Description of problem:
> qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.
> There are three different cases:
> a] stopped ALL brokers
> b] stopped the primary
> c] stopped a backup
> In any of above listed cases following observations were made:
> a-c] RHCS clustat is just fine and report everything is just ok
> a-c] qpid-ha (status --all) hangs
> a,b,c*] any other clients are indefinitely blocked
> a-b] cases directly at the beginning
> c] case at the end, client able to recover after minute or so,
> due to connection timeout
> In fact this defect also proves that qpid-ha can be out of sync when compared
> to clustat as tracked by BZ.
> The expectations are:
> * a] quorum lost HA down (same as kill -9 to all nodes)
> no clients able to communicate
> * b] promotion of new primary, there has to be mechanism to get rid of
> stopped process
> clients should be able to communicate after recovery
> * c] unresponsive backup should get restarted
> clients should be able to communicate after duration when backup is
> detected as unresponsive
> * Generally better integration Qpid HA environment <-> RHCS is needed
> aka SIGSTOP detection
> * Heartbeat primary <-> backups probably needed
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]