John Anderson created AMQ-6092:
----------------------------------
Summary: Clear Broker to Broker Connection Info At Startup
Key: AMQ-6092
URL: https://issues.apache.org/jira/browse/AMQ-6092
Project: ActiveMQ
Issue Type: Bug
Components: activemq-leveldb-store
Affects Versions: 5.12.0
Environment: Linux
Reporter: John Anderson
Priority: Minor
This is a very difficult bug to describe, and an even tougher bug to replicate,
so I guess I'll start by describing the circumstances that triggered this bug.
At each of 3 data centers I have replicated leveldb ActiveMQ cluster. There
are store and forward connections between each data center. Phoenix has
non-duplexed connections to Amsterdam and Ashburn, and in turn each of those
sites has connections to the others. This makes a mesh type topography. Within
a single datacenter, I have 3 copies of each broker using the replicated
LevelDB feature in a kind of active/passive/passive configuration.
This is just a PoC setup, sitting on VMware infrastructure, and it sat idle for
quite some time. At some point, while it was sitting idle, we had a storage
maintenance, which caused a storage disconnect in Ashburn and Amsterdam. A
storage disconnect is akin to just pulling the disk out of the box. Needless
to say, AMQ didn't like this one bit. However, surviving a storage disconnect
isn't really the point of the bug. The bug came in to play when I tried
restarting the cluster after storage was restored.
I restarted each of the VMs, and began to bring the ActiveMQ instances back
online, starting zookeeper, then starting ActiveMQ. After bringing each
replicated LevelDB group back up, they refused to reconnect to each other via
the store & forward connections. I kept getting this error:
bq. Failed to add Connection ams1-1->ash1-1-38769-1450213134683-58409:1 due to
javax.jms.InvalidClientIDException: Broker: ams1-1 - Client:
ams1-1_ash1-1_queues_ash1-1_inbound_ams1-1 already connected from vm://ams1-1#0
| org.apache.activemq.broker.TransportConnection |
triggerStartAsyncNetworkBridgeCreation: remoteBroker=unconnected, localBroker=
vm://ams1-1#58408
Not a single broker would connect to another broker, and the messages imply
that these connections already existed. However, I could see that the
connections were trying to be established, using netstat, and the fact that
this message occured over and over, like they were retrying. However, the
web-based admin console showed nothing under Network. Not a single real
connection was made.
After a lot of troubleshooting, especially looking into the LDAP
Authentication/Authorization settings and mechanism, I finally figured that it
had to be something persisted, because this exact same setup, without a single
configuration change, had been working perfectly before the storage disconnect.
In the end, I ended up completely deleting the LevelDB directory, and
restarting ActiveMQ on each node, and the setup is working flawlessly once
again.
I haven't yet tried 5.13.0, and I'm pretty sure management isn't going to allow
me to cause a storage disconnect so I can test it, but I have a feeling that
some information about store & forward connections is stored in the persistent
store, and some sort of short-write occurred when the storage disconnect
happened. However, since this data, whatever it may be, wasn't cleared or
reset at broker startup, the broker erroneously believed that the connections I
was trying to establish already existed.
This may be an incorrect assumption, but at startup, the broker should reset
any data it has that pertains to store and forward connections, because there's
no way anything can really be connected at that time.
I'll attach my configurations so that the environment, if not the storage
disconnect, can be replicated.
The steps to reproduce, if they were practical would be:
1.) Setup an AMQ store & forward mesh based on the attached configurations, and
on VMWare ESX infrastructure.
2.) Cause a storage interruption.
3.) Reboot the VMs running AMQ to reset the read-only state of the block
devices, after the storage interruption.
4.) Try to bring the cluster back online.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)