[jira] Commented: (QPID-2982) Inconsistent management messages in a cluster

Mark Moseley (JIRA) Fri, 07 Jan 2011 16:58:13 -0800

    [ 
https://issues.apache.org/jira/browse/QPID-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979045#action_12979045
 ]


Mark Moseley commented on QPID-2982:
------------------------------------

Commenting per Alan's email. Same setup as JIRA 2992. I was getting this with 
Squeeze corosync (1.2.1) and openais (1.1.2); I upgraded them today (to 1.3.0 
and 1.1.4, respectively) and still get this error.

As in 2992, there's two clusters, A (with nodes A1 and A2) and B (with B1 and 
B2).

I'm doing the same setup as in 2992:


On all boxes in cluster A and B:
* Kill the qpidd if it's running and delete all existing store files, i.e. 
contents of /var/lib/qpid/

On host A1 in cluster A (I'm leaving out the -a user/t...@host stuff):
* Start up qpid
* qpid-config add exchange direct bosmyex1 --durable
* qpid-config add exchange direct walmyex1 --durable
* qpid-config add queue walmyq1 --durable
* qpid-config bind walmyex1 walmyq1 unix.waltham.cust

On host B1 in cluster B:
* qpid-config add exchange direct bosmyex1 --durable
* qpid-config add exchange direct walmyex1 --durable
* qpid-config add queue bosmyq1 --durable
* qpid-config bind bosmyex1 bosmyq1 unix.boston.cust

On cluster A:
* Start other member of cluster, A2
* qpid-route route add amqps://user/p...@hosta1:5671 
amqps://user/p...@hostb1:5671 walmyex1 unix.waltham.cust -d

On cluster B:
* Start other member of cluster, B2
* qpid-route route add amqps://user/p...@hostb1:5671 
amqps://user/p...@hosta1:5671 bosmyex1 unix.boston.cust -d

On either cluster:
* Check "qpid-config exchanges --bindings" to make sure bindings are correct 
for remote exchanges
* To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start 
cluster back up, check bindings.
* To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start 
cluster back up, check bindings.


If on A1, I run qpid-tool and do something like 'list' enough times that it 
returns data (instead of the first few times where it returns nothing), on node 
A2, I get the following very often This is the log on A2 all the way from 
startup, which is only 10-15 secs earlier:

2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/watchdog.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/acl.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/xml.so
2011-01-07 19:35:30 info Loaded Module: 
/usr/lib/qpid/daemon/replicating_listener.so
2011-01-07 19:35:30 info Loaded Module: 
/usr/lib/qpid/daemon/replication_exchange.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/ssl.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/cluster.so
2011-01-07 19:35:30 info Management enabled
2011-01-07 19:35:30 info Registered xml exchange
2011-01-07 19:35:30 info Registered replication exchange
2011-01-07 19:35:31 notice Journal "TplStore": Created
2011-01-07 19:35:31 notice Store module initialized; store-dir=/var/lib/qpid
2011-01-07 19:35:31 info > Default files per journal: 8
2011-01-07 19:35:31 info > Default journal file size: 24 (wpgs)
2011-01-07 19:35:31 info > Default write cache page size: 32 (Kib)
2011-01-07 19:35:31 info > Default number of write cache pages: 32
2011-01-07 19:35:31 info > TPL files per journal: 8
2011-01-07 19:35:31 info > TPL journal file size: 24 (wpgs)
2011-01-07 19:35:31 info > TPL write cache page size: 4 (Kib)
2011-01-07 19:35:31 info > TPL number of write cache pages: 64
2011-01-07 19:35:31 notice Initializing CPG
2011-01-07 19:35:31 notice Cluster store state: dirty 
cluster-id=54ef8a97-b8e8-47a2-935d-e619e4d7c895
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) configuration 
change: 101.0.0.0:10552 102.0.0.0:13904 
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) Members joined: 
102.0.0.0:13904 
2011-01-07 19:35:31 notice Cluster recovery: recovered journal data discarded 
and journal files pushed down
2011-01-07 19:35:31 notice Store directory /var/lib/qpid/rhm was pushed down 
(saved) into directory /var/lib/qpid/_cluster.bak.0002.
2011-01-07 19:35:31 notice Journal "TplStore": Created
2011-01-07 19:35:31 info SASL enabled
2011-01-07 19:35:31 notice Listening on TCP port 5672
2011-01-07 19:35:31 notice Read ACL file "/etc/qpid/acl.conf"
2011-01-07 19:35:31 notice Listening for SSL connections on TCP port 5671
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 INIT) not active for links.
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 JOINER) joining cluster 
walclust
2011-01-07 19:35:31 notice Broker running
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 JOINER) stall for update offer 
from 101.0.0.0:10552 to 102.0.0.0:13904
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 UPDATEE) receiving update 
from 101.0.0.0:10552
2011-01-07 19:35:31 notice Journal "walmyq1": Created
2011-01-07 19:35:31 info Queue "walmyq1": Policy created: type=reject; 
maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qpid.cluster-update": Policy created: 
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 
101.0.0.0:10552-6
2011-01-07 19:35:31 info Queue 
"bridge_queue_1_d0638d51-8b76-48d0-a232-10e8bbed772d": Policy created: 
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 
101.0.0.0:10552-7
2011-01-07 19:35:31 info Queue "reply-HOSTA1.10654.1": Policy created: 
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "topic-HOSTA1.10654.1": Policy created: 
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-HOSTA1.10654.1": Policy created: 
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-ui-HOSTA1.10654.1": Policy created: 
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-hb-HOSTA1.10654.1": Policy created: 
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection 
101.0.0.0:10552-9
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 
101.0.0.0:10552(member) 102.0.0.0:13904(unknown) frameSeq=2427
2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 CATCHUP) update complete, 
starting catch-up.
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update: 
101.0.0.0:10552(member) 102.0.0.0:13904(member) frameSeq=2427
2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 READY) caught up.
2011-01-07 19:35:32 info Passivating links
2011-01-07 19:35:39 info Queue "reply-HOSTA1.10670.1": Policy created: 
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "topic-HOSTA1.10670.1": Policy created: 
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-HOSTA1.10670.1": Policy created: 
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-ui-HOSTA1.10670.1": Policy created: 
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-hb-HOSTA1.10670.1": Policy created: 
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:51 error Execution exception: invalid-argument: 
[email protected]: confirmed < (109+0) but only sent < (108+0) 
(qpid/SessionState.cpp:154)
2011-01-07 19:35:51 critical cluster(102.0.0.0:13904 READY/error) local error 
2770 did not occur on member 101.0.0.0:10552: invalid-argument: 
[email protected]: confirmed < (109+0) but only sent < (108+0) 
(qpid/SessionState.cpp:154)
2011-01-07 19:35:51 critical Error delivering frames: local error did not occur 
on all cluster members : invalid-argument: [email protected]: confirmed 
< (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154) 
(qpid/cluster/ErrorCheck.cpp:89)
2011-01-07 19:35:51 notice cluster(102.0.0.0:13904 LEFT/error) leaving cluster 
walclust
2011-01-07 19:35:51 notice Shut down


In the logs above:
* 'mark' is the qpid SASL username I'm using. It's definitely valid on both 
nodes, as I can use it in qpid-config just fine in A1 and A2. I also use it in 
qpid-tool successfully.
* Those "Policy created" messages show up when I initially did 'list' in 
qpid-tool on A1.
* When 'list' returned data on A1, A2 shut down.

I don't get the above error 100% of the time. Sometimes it'll work consistently 
and then other times it won't, and won't for a while (and A2 will shut back 
down almost immediately after rejoining the cluster). The only times that I can 
consistently get it to crash is if I replicate the broken scenario in JIRA 
2992. After doing a fail of B2->B1, A2 will crash as soon as qpid-tool starts 
returning data (but not before), even if I restart A1 and A2 repeatedly.

These are test clusters, so there's nothing else going on on them. As well, I 
can perform whatever testing would be useful to the Qpid team.

> Inconsistent management messages in a cluster
> ---------------------------------------------
>
>                 Key: QPID-2982
>                 URL: https://issues.apache.org/jira/browse/QPID-2982
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.8
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.9
>
>
> Management messages on different members of a cluster can be different, which 
> can cause a cluster member to fail with an invalid-arg error.
> The cluster_tests.py has a test_management test which exercises management 
> code and collects detailed trace logging. The logs are compared for 
> consistency.
> Currently this test fails. All inconsistencies revealed by the test need to 
> be corrected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

[jira] Commented: (QPID-2982) Inconsistent management messages in a cluster

Reply via email to