[
https://issues.apache.org/jira/browse/AMQ-6092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Anderson updated AMQ-6092:
-------------------------------
Attachment: activemq-configurations.tar.gz
Configruation files of the cluster:
conf/activemq.xml == Configuration for each & every node.
conf/login.config == JAAS configuration for each & every node.
conf/site.properties == properties specific to each node at one site.
conf/local.properties == properties specific to one single node.
> Clear Broker to Broker Connection Info At Startup
> -------------------------------------------------
>
> Key: AMQ-6092
> URL: https://issues.apache.org/jira/browse/AMQ-6092
> Project: ActiveMQ
> Issue Type: Bug
> Components: activemq-leveldb-store
> Affects Versions: 5.12.0
> Environment: Linux
> Reporter: John Anderson
> Priority: Minor
> Attachments: activemq-configurations.tar.gz
>
>
> This is a very difficult bug to describe, and an even tougher bug to
> replicate, so I guess I'll start by describing the circumstances that
> triggered this bug.
> At each of 3 data centers I have replicated leveldb ActiveMQ cluster. There
> are store and forward connections between each data center. Phoenix has
> non-duplexed connections to Amsterdam and Ashburn, and in turn each of those
> sites has connections to the others. This makes a mesh type topography.
> Within a single datacenter, I have 3 copies of each broker using the
> replicated LevelDB feature in a kind of active/passive/passive configuration.
> This is just a PoC setup, sitting on VMware infrastructure, and it sat idle
> for quite some time. At some point, while it was sitting idle, we had a
> storage maintenance, which caused a storage disconnect in Ashburn and
> Amsterdam. A storage disconnect is akin to just pulling the disk out of the
> box. Needless to say, AMQ didn't like this one bit. However, surviving a
> storage disconnect isn't really the point of the bug. The bug came in to
> play when I tried restarting the cluster after storage was restored.
> I restarted each of the VMs, and began to bring the ActiveMQ instances back
> online, starting zookeeper, then starting ActiveMQ. After bringing each
> replicated LevelDB group back up, they refused to reconnect to each other via
> the store & forward connections. I kept getting this error:
> bq. Failed to add Connection ams1-1->ash1-1-38769-1450213134683-58409:1 due
> to javax.jms.InvalidClientIDException: Broker: ams1-1 - Client:
> ams1-1_ash1-1_queues_ash1-1_inbound_ams1-1 already connected from
> vm://ams1-1#0 | org.apache.activemq.broker.TransportConnection |
> triggerStartAsyncNetworkBridgeCreation: remoteBroker=unconnected,
> localBroker= vm://ams1-1#58408
> Not a single broker would connect to another broker, and the messages imply
> that these connections already existed. However, I could see that the
> connections were trying to be established, using netstat, and the fact that
> this message occured over and over, like they were retrying. However, the
> web-based admin console showed nothing under Network. Not a single real
> connection was made.
> After a lot of troubleshooting, especially looking into the LDAP
> Authentication/Authorization settings and mechanism, I finally figured that
> it had to be something persisted, because this exact same setup, without a
> single configuration change, had been working perfectly before the storage
> disconnect.
> In the end, I ended up completely deleting the LevelDB directory, and
> restarting ActiveMQ on each node, and the setup is working flawlessly once
> again.
> I haven't yet tried 5.13.0, and I'm pretty sure management isn't going to
> allow me to cause a storage disconnect so I can test it, but I have a feeling
> that some information about store & forward connections is stored in the
> persistent store, and some sort of short-write occurred when the storage
> disconnect happened. However, since this data, whatever it may be, wasn't
> cleared or reset at broker startup, the broker erroneously believed that the
> connections I was trying to establish already existed.
> This may be an incorrect assumption, but at startup, the broker should reset
> any data it has that pertains to store and forward connections, because
> there's no way anything can really be connected at that time.
> I'll attach my configurations so that the environment, if not the storage
> disconnect, can be replicated.
> The steps to reproduce, if they were practical would be:
> 1.) Setup an AMQ store & forward mesh based on the attached configurations,
> and on VMWare ESX infrastructure.
> 2.) Cause a storage interruption.
> 3.) Reboot the VMs running AMQ to reset the read-only state of the block
> devices, after the storage interruption.
> 4.) Try to bring the cluster back online.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)