[jira] [Created] (AMQ-6092) Clear Broker to Broker Connection Info At Startup

John Anderson (JIRA) Wed, 16 Dec 2015 16:03:07 -0800

John Anderson created AMQ-6092:
----------------------------------

             Summary: Clear Broker to Broker Connection Info At Startup
                 Key: AMQ-6092
                 URL: https://issues.apache.org/jira/browse/AMQ-6092
             Project: ActiveMQ
          Issue Type: Bug
          Components: activemq-leveldb-store
    Affects Versions: 5.12.0
         Environment: Linux
            Reporter: John Anderson
            Priority: Minor



This is a very difficult bug to describe, and an even tougher bug to replicate, 
so I guess I'll start by describing the circumstances that triggered this bug.

At each of 3 data centers I have replicated leveldb ActiveMQ cluster.  There 
are store and forward connections between each data center. Phoenix has 
non-duplexed connections to Amsterdam and Ashburn, and in turn each of those 
sites has connections to the others.  This makes a mesh type topography. Within 
a single datacenter, I have 3 copies of each broker using the replicated 
LevelDB feature in a kind of active/passive/passive configuration.

This is just a PoC setup, sitting on VMware infrastructure, and it sat idle for 
quite some time.  At some point, while it was sitting idle, we had a storage 
maintenance, which caused a storage disconnect in Ashburn and Amsterdam.  A 
storage disconnect is akin to just pulling the disk out of the box.  Needless 
to say, AMQ didn't like this one bit.  However, surviving a storage disconnect 
isn't really the point of the bug.  The bug came in to play when I tried 
restarting the cluster after storage was restored.   

I restarted each of the VMs, and began to bring the ActiveMQ instances back 
online, starting zookeeper, then starting ActiveMQ.   After bringing each 
replicated LevelDB group back up, they refused to reconnect to each other via 
the store & forward connections.  I kept getting this error:


bq. Failed to add Connection ams1-1->ash1-1-38769-1450213134683-58409:1 due to 
javax.jms.InvalidClientIDException: Broker: ams1-1 - Client: 
ams1-1_ash1-1_queues_ash1-1_inbound_ams1-1 already connected from vm://ams1-1#0 
| org.apache.activemq.broker.TransportConnection | 
triggerStartAsyncNetworkBridgeCreation: remoteBroker=unconnected, localBroker= 
vm://ams1-1#58408


Not a single broker would connect to another broker, and the messages imply 
that these connections already existed.  However, I could see that the 
connections were trying to be established, using netstat, and the fact that 
this message occured over and over, like they were retrying.  However, the 
web-based admin console showed nothing under Network.  Not a single real 
connection was made.

After a lot of troubleshooting, especially looking into the LDAP 
Authentication/Authorization settings and mechanism, I finally figured that it 
had to be something persisted, because this exact same setup, without a single 
configuration change, had been working perfectly before the storage disconnect.

In the end, I ended up completely deleting the LevelDB directory, and 
restarting ActiveMQ on each node, and the setup is working flawlessly once 
again.

I haven't yet tried 5.13.0, and I'm pretty sure management isn't going to allow 
me to cause a storage disconnect so I can test it, but I have a feeling that 
some information about store & forward connections is stored in the persistent 
store, and some sort of short-write occurred when the storage disconnect 
happened.  However, since this data, whatever it may be, wasn't cleared or 
reset at broker startup, the broker erroneously believed that the connections I 
was trying to establish already existed.

This may be an incorrect assumption, but at startup, the broker should reset 
any data it has that pertains to store and forward connections, because there's 
no way anything can really be connected at that time.

I'll attach my configurations so that the environment, if not the storage 
disconnect, can be replicated.

The steps to reproduce, if they were practical would be:

1.) Setup an AMQ store & forward mesh based on the attached configurations, and 
on VMWare ESX infrastructure.
2.) Cause a storage interruption.
3.) Reboot the VMs running AMQ to reset the read-only state of the block 
devices, after the storage interruption.
4.) Try to bring the cluster back online.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (AMQ-6092) Clear Broker to Broker Connection Info At Startup

Reply via email to