Alan Conway created QPID-5719:
---------------------------------

             Summary: HA becomes unresponsive once any of the brokers are 
SIGSTOPed
                 Key: QPID-5719
                 URL: https://issues.apache.org/jira/browse/QPID-5719
             Project: Qpid
          Issue Type: Bug
          Components: C++ Clustering
    Affects Versions: 0.28
            Reporter: Alan Conway
            Assignee: Alan Conway


See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638

Description of problem:

qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.

There are three different cases:
a] stopped ALL brokers
b] stopped the primary
c] stopped a backup

In any of above listed cases following observations were made:

a-c]    RHCS clustat is just fine and report everything is just ok
a-c]    qpid-ha (status --all) hangs
a,b,c*] any other clients are indefinitely blocked
        a-b] cases directly at the beginning
        c] case at the end, client able to recover after minute or so,
           due to connection timeout

In fact this defect also proves that qpid-ha can be out of sync when compared 
to clustat as tracked by BZ.

The expectations are:
 * a] quorum lost HA down (same as kill -9 to all nodes)
      no clients able to communicate
 * b] promotion of new primary, there has to be mechanism to get rid of stopped 
process
      clients should be able to communicate after recovery
 * c] unresponsive backup should get restarted
      clients should be able to communicate after duration when backup is 
detected as unresponsive

 * Generally better integration Qpid HA environment <-> RHCS is needed
   aka SIGSTOP detection
 * Heartbeat primary <-> backups probably needed




--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to