[
https://issues.apache.org/jira/browse/QPID-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979045#action_12979045
]
Mark Moseley commented on QPID-2982:
------------------------------------
Commenting per Alan's email. Same setup as JIRA 2992. I was getting this with
Squeeze corosync (1.2.1) and openais (1.1.2); I upgraded them today (to 1.3.0
and 1.1.4, respectively) and still get this error.
As in 2992, there's two clusters, A (with nodes A1 and A2) and B (with B1 and
B2).
I'm doing the same setup as in 2992:
On all boxes in cluster A and B:
* Kill the qpidd if it's running and delete all existing store files, i.e.
contents of /var/lib/qpid/
On host A1 in cluster A (I'm leaving out the -a user/t...@host stuff):
* Start up qpid
* qpid-config add exchange direct bosmyex1 --durable
* qpid-config add exchange direct walmyex1 --durable
* qpid-config add queue walmyq1 --durable
* qpid-config bind walmyex1 walmyq1 unix.waltham.cust
On host B1 in cluster B:
* qpid-config add exchange direct bosmyex1 --durable
* qpid-config add exchange direct walmyex1 --durable
* qpid-config add queue bosmyq1 --durable
* qpid-config bind bosmyex1 bosmyq1 unix.boston.cust
On cluster A:
* Start other member of cluster, A2
* qpid-route route add amqps://user/p...@hosta1:5671
amqps://user/p...@hostb1:5671 walmyex1 unix.waltham.cust -d
On cluster B:
* Start other member of cluster, B2
* qpid-route route add amqps://user/p...@hostb1:5671
amqps://user/p...@hosta1:5671 bosmyex1 unix.boston.cust -d
On either cluster:
* Check "qpid-config exchanges --bindings" to make sure bindings are correct
for remote exchanges
* To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start
cluster back up, check bindings.
* To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start
cluster back up, check bindings.
If on A1, I run qpid-tool and do something like 'list' enough times that it
returns data (instead of the first few times where it returns nothing), on node
A2, I get the following very often This is the log on A2 all the way from
startup, which is only 10-15 secs earlier:
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/watchdog.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/acl.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/xml.so
2011-01-07 19:35:30 info Loaded Module:
/usr/lib/qpid/daemon/replicating_listener.so
2011-01-07 19:35:30 info Loaded Module:
/usr/lib/qpid/daemon/replication_exchange.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/ssl.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so
2011-01-07 19:35:30 info Loaded Module: /usr/lib/qpid/daemon/cluster.so
2011-01-07 19:35:30 info Management enabled
2011-01-07 19:35:30 info Registered xml exchange
2011-01-07 19:35:30 info Registered replication exchange
2011-01-07 19:35:31 notice Journal "TplStore": Created
2011-01-07 19:35:31 notice Store module initialized; store-dir=/var/lib/qpid
2011-01-07 19:35:31 info > Default files per journal: 8
2011-01-07 19:35:31 info > Default journal file size: 24 (wpgs)
2011-01-07 19:35:31 info > Default write cache page size: 32 (Kib)
2011-01-07 19:35:31 info > Default number of write cache pages: 32
2011-01-07 19:35:31 info > TPL files per journal: 8
2011-01-07 19:35:31 info > TPL journal file size: 24 (wpgs)
2011-01-07 19:35:31 info > TPL write cache page size: 4 (Kib)
2011-01-07 19:35:31 info > TPL number of write cache pages: 64
2011-01-07 19:35:31 notice Initializing CPG
2011-01-07 19:35:31 notice Cluster store state: dirty
cluster-id=54ef8a97-b8e8-47a2-935d-e619e4d7c895
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) configuration
change: 101.0.0.0:10552 102.0.0.0:13904
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 PRE_INIT) Members joined:
102.0.0.0:13904
2011-01-07 19:35:31 notice Cluster recovery: recovered journal data discarded
and journal files pushed down
2011-01-07 19:35:31 notice Store directory /var/lib/qpid/rhm was pushed down
(saved) into directory /var/lib/qpid/_cluster.bak.0002.
2011-01-07 19:35:31 notice Journal "TplStore": Created
2011-01-07 19:35:31 info SASL enabled
2011-01-07 19:35:31 notice Listening on TCP port 5672
2011-01-07 19:35:31 notice Read ACL file "/etc/qpid/acl.conf"
2011-01-07 19:35:31 notice Listening for SSL connections on TCP port 5671
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 INIT) not active for links.
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 JOINER) joining cluster
walclust
2011-01-07 19:35:31 notice Broker running
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 JOINER) stall for update offer
from 101.0.0.0:10552 to 102.0.0.0:13904
2011-01-07 19:35:31 notice cluster(102.0.0.0:13904 UPDATEE) receiving update
from 101.0.0.0:10552
2011-01-07 19:35:31 notice Journal "walmyq1": Created
2011-01-07 19:35:31 info Queue "walmyq1": Policy created: type=reject;
maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qpid.cluster-update": Policy created:
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection
101.0.0.0:10552-6
2011-01-07 19:35:31 info Queue
"bridge_queue_1_d0638d51-8b76-48d0-a232-10e8bbed772d": Policy created:
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection
101.0.0.0:10552-7
2011-01-07 19:35:31 info Queue "reply-HOSTA1.10654.1": Policy created:
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "topic-HOSTA1.10654.1": Policy created:
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-HOSTA1.10654.1": Policy created:
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-ui-HOSTA1.10654.1": Policy created:
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:31 info Queue "qmfc-v2-hb-HOSTA1.10654.1": Policy created:
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 UPDATEE) new shadow connection
101.0.0.0:10552-9
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update:
101.0.0.0:10552(member) 102.0.0.0:13904(unknown) frameSeq=2427
2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 CATCHUP) update complete,
starting catch-up.
2011-01-07 19:35:32 info cluster(102.0.0.0:13904 CATCHUP) member update:
101.0.0.0:10552(member) 102.0.0.0:13904(member) frameSeq=2427
2011-01-07 19:35:32 notice cluster(102.0.0.0:13904 READY) caught up.
2011-01-07 19:35:32 info Passivating links
2011-01-07 19:35:39 info Queue "reply-HOSTA1.10670.1": Policy created:
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "topic-HOSTA1.10670.1": Policy created:
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-HOSTA1.10670.1": Policy created:
type=reject; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-ui-HOSTA1.10670.1": Policy created:
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:40 info Queue "qmfc-v2-hb-HOSTA1.10670.1": Policy created:
type=ring; maxCount=0; maxSize=104857600
2011-01-07 19:35:51 error Execution exception: invalid-argument:
[email protected]: confirmed < (109+0) but only sent < (108+0)
(qpid/SessionState.cpp:154)
2011-01-07 19:35:51 critical cluster(102.0.0.0:13904 READY/error) local error
2770 did not occur on member 101.0.0.0:10552: invalid-argument:
[email protected]: confirmed < (109+0) but only sent < (108+0)
(qpid/SessionState.cpp:154)
2011-01-07 19:35:51 critical Error delivering frames: local error did not occur
on all cluster members : invalid-argument: [email protected]: confirmed
< (109+0) but only sent < (108+0) (qpid/SessionState.cpp:154)
(qpid/cluster/ErrorCheck.cpp:89)
2011-01-07 19:35:51 notice cluster(102.0.0.0:13904 LEFT/error) leaving cluster
walclust
2011-01-07 19:35:51 notice Shut down
In the logs above:
* 'mark' is the qpid SASL username I'm using. It's definitely valid on both
nodes, as I can use it in qpid-config just fine in A1 and A2. I also use it in
qpid-tool successfully.
* Those "Policy created" messages show up when I initially did 'list' in
qpid-tool on A1.
* When 'list' returned data on A1, A2 shut down.
I don't get the above error 100% of the time. Sometimes it'll work consistently
and then other times it won't, and won't for a while (and A2 will shut back
down almost immediately after rejoining the cluster). The only times that I can
consistently get it to crash is if I replicate the broken scenario in JIRA
2992. After doing a fail of B2->B1, A2 will crash as soon as qpid-tool starts
returning data (but not before), even if I restart A1 and A2 repeatedly.
These are test clusters, so there's nothing else going on on them. As well, I
can perform whatever testing would be useful to the Qpid team.
> Inconsistent management messages in a cluster
> ---------------------------------------------
>
> Key: QPID-2982
> URL: https://issues.apache.org/jira/browse/QPID-2982
> Project: Qpid
> Issue Type: Bug
> Components: C++ Clustering
> Affects Versions: 0.8
> Reporter: Alan Conway
> Assignee: Alan Conway
> Fix For: 0.9
>
>
> Management messages on different members of a cluster can be different, which
> can cause a cluster member to fail with an invalid-arg error.
> The cluster_tests.py has a test_management test which exercises management
> code and collects detailed trace logging. The logs are compared for
> consistency.
> Currently this test fails. All inconsistencies revealed by the test need to
> be corrected.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project: http://qpid.apache.org
Use/Interact: mailto:[email protected]