[jira] [Updated] (QPID-4082) cluster de-sync after broker restart & queue replication

Pavel Moravec (JIRA) Thu, 21 Jun 2012 00:02:49 -0700

     [ 
https://issues.apache.org/jira/browse/QPID-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Pavel Moravec updated QPID-4082:
--------------------------------

    Attachment: QPID-4082.patch

Patch proposal. The bug is caused by "deliveryCount" in 
SemanticState::ConsumerImpl (qpid/broker/SemanticState.cpp) not being 
replicated to a joining cluster node during catch-up. When the elder broker in 
src.cluster sends session.sync() after sending 5 messages (per --ack 5 in 
qpid-route), the recently joiner node in src.cluster does not do so, what leads 
to the cluster de-sync.

The patch:
- adds to "consumer-state" method (see xml/cluster.xml file change) to update a 
new joiner a new property deliveryCount 
- updates cluster::Connection::consumerState to send deliveryCount to the method
- updates cluster::Connection::consumerState to set the received deliveryCount
- add two methods to broker::SemanticState::ConsumerImpl for getting and 
setting deliveryCount

                
> cluster de-sync after broker restart & queue replication
> --------------------------------------------------------
>
>                 Key: QPID-4082
>                 URL: https://issues.apache.org/jira/browse/QPID-4082
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.16
>            Reporter: Pavel Moravec
>            Assignee: Alan Conway
>            Priority: Minor
>              Labels: patch
>         Attachments: QPID-4082.patch
>
>
> Description of problem:
> Having queue state replication between 2 clusters, restarting _a_ broker in 
> both source+destination clusters sometimes leads to cluster de-sync. No QMF 
> communication is involved, though symptoms are similar to the bug caused by 
> missing propagation of QMF errors within a cluster.
> Version-Release number of selected component (if applicable):
> spotted in qpid 0.14, expected also in 0.16
> How reproducible:
> 100% within 10 minutes.
> Steps to Reproduce:
> 1. Have 2node src. cluster and 2node dst cluster (see reproducer for example 
> config and also for a reproducer script for further steps).
> 2. Have a queue state replication between the clusters.
> 3. Randomly stop or start a broker in a cluster (such that everytime both 
> clusters have at least 1 node running - i.e. stop+start only non-elder 
> brokers)
> 4. After each stop or start, send 1 message to the src.broker to a queue to 
> be replicated.
> 5. Wait some time
>   
> Actual results:
> The started-up broker in src.cluster may shutdown after logging:
> 2012-05-31 11:58:40 critical cluster(10.34.1.218:26715 READY/error) local 
> error 502 did not occur on member 10.34.1.218:26294: invalid-argument: 
> anonymous.b941dd87-3fa1-442d-99f7-8c0907599b30: confirmed < (24+0) but only 
> sent < (23+0) (qpid/SessionState.cpp:154)
> Expected results:
> No such error
> Additional info:
> - the affected session is always federation route for the queue state 
> replication
> - the stop and start of both one src and one dst broker is essential in the 
> scenario, e.g. without (re)starting a dst.broker, no error.
> - sometimes almost deterministic scenario is:
> 1) start everything, send a message
> 2) stop a dst.broker, send a message
> 3) stop a src.broker, send a message
> 4) start src.broker, then dst.broker
> 5) wait some time (i.e. 10 seconds) and send a message
> Sometimes I got instantly the error, sometimes never.
> Patch to be proposed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (QPID-4082) cluster de-sync after broker restart & queue replication

Reply via email to