[jira] [Reopened] (QPID-5719) HA becomes unresponsive once any of the brokers are SIGSTOPed

Alan Conway (JIRA) Fri, 18 Jul 2014 09:28:38 -0700

     [ 
https://issues.apache.org/jira/browse/QPID-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alan Conway reopened QPID-5719:
-------------------------------


 Frantisek Reznicek 2014-07-18 08:44:38 EDT

There is great improvement in behavior, but still not completely there.

Neither stopping primary nor backup causes HA cluster to hang the operation now.

Stopping primary causes primary to be restarted, but stopping backup causes 
backup to stay stopped and not return to qpid HA as backup.

Testing details:

a] [PASS] stopped ALL brokers
   Whole cluster is restarted and ending in joining state which is documented.
   This scenario we agreed we cannot handle.
b] [PASS] stopped the primary
   Primary hang is detected and restarted as backup and another ready backup is 
promoted as primary.
c] [FAIL] stopped a backup !!!
   Backup stays in stopped state forever (tested couple of minutes with 
link-heartbeat-interval=2). After such time kill -CONT causes backup to get 
back to cluster which is also dangerous as can be completely out-of-sync.
d] [PASS] killed-9 ALL brokers
   Whole cluster is restarted and ending in joining state which is documented.
   This scenario we agreed we cannot handle.
e] [PASS] killed-9 the primary
   Primary is restarted as backup and another ready backup is promoted as 
primary.
f] [PASS] killed-9 a backup
   Backup is restarted as backup.




> HA becomes unresponsive once any of the brokers are SIGSTOPed
> -------------------------------------------------------------
>
>                 Key: QPID-5719
>                 URL: https://issues.apache.org/jira/browse/QPID-5719
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.28
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.29
>
>         Attachments: ha-heartbeat.diff
>
>
> See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638
> Description of problem:
> qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.
> There are three different cases:
> a] stopped ALL brokers
> b] stopped the primary
> c] stopped a backup
> In any of above listed cases following observations were made:
> a-c]    RHCS clustat is just fine and report everything is just ok
> a-c]    qpid-ha (status --all) hangs
> a,b,c*] any other clients are indefinitely blocked
>         a-b] cases directly at the beginning
>         c] case at the end, client able to recover after minute or so,
>            due to connection timeout
> In fact this defect also proves that qpid-ha can be out of sync when compared 
> to clustat as tracked by BZ.
> The expectations are:
>  * a] quorum lost HA down (same as kill -9 to all nodes)
>       no clients able to communicate
>  * b] promotion of new primary, there has to be mechanism to get rid of 
> stopped process
>       clients should be able to communicate after recovery
>  * c] unresponsive backup should get restarted
>       clients should be able to communicate after duration when backup is 
> detected as unresponsive
>  * Generally better integration Qpid HA environment <-> RHCS is needed
>    aka SIGSTOP detection
>  * Heartbeat primary <-> backups probably needed



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Reopened] (QPID-5719) HA becomes unresponsive once any of the brokers are SIGSTOPed

Reply via email to