[ 
https://issues.apache.org/jira/browse/AMQ-6092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Anderson updated AMQ-6092:
-------------------------------
    Attachment: activemq-configurations.tar.gz

Configruation files of the cluster:

conf/activemq.xml  == Configuration for each & every node.
conf/login.config == JAAS configuration for each & every node.
conf/site.properties == properties specific to each node at one site.
conf/local.properties == properties specific to one single node.

> Clear Broker to Broker Connection Info At Startup
> -------------------------------------------------
>
>                 Key: AMQ-6092
>                 URL: https://issues.apache.org/jira/browse/AMQ-6092
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: activemq-leveldb-store
>    Affects Versions: 5.12.0
>         Environment: Linux
>            Reporter: John Anderson
>            Priority: Minor
>         Attachments: activemq-configurations.tar.gz
>
>
> This is a very difficult bug to describe, and an even tougher bug to 
> replicate, so I guess I'll start by describing the circumstances that 
> triggered this bug.
> At each of 3 data centers I have replicated leveldb ActiveMQ cluster.  There 
> are store and forward connections between each data center. Phoenix has 
> non-duplexed connections to Amsterdam and Ashburn, and in turn each of those 
> sites has connections to the others.  This makes a mesh type topography. 
> Within a single datacenter, I have 3 copies of each broker using the 
> replicated LevelDB feature in a kind of active/passive/passive configuration.
> This is just a PoC setup, sitting on VMware infrastructure, and it sat idle 
> for quite some time.  At some point, while it was sitting idle, we had a 
> storage maintenance, which caused a storage disconnect in Ashburn and 
> Amsterdam.  A storage disconnect is akin to just pulling the disk out of the 
> box.  Needless to say, AMQ didn't like this one bit.  However, surviving a 
> storage disconnect isn't really the point of the bug.  The bug came in to 
> play when I tried restarting the cluster after storage was restored.   
> I restarted each of the VMs, and began to bring the ActiveMQ instances back 
> online, starting zookeeper, then starting ActiveMQ.   After bringing each 
> replicated LevelDB group back up, they refused to reconnect to each other via 
> the store & forward connections.  I kept getting this error:
> bq. Failed to add Connection ams1-1->ash1-1-38769-1450213134683-58409:1 due 
> to javax.jms.InvalidClientIDException: Broker: ams1-1 - Client: 
> ams1-1_ash1-1_queues_ash1-1_inbound_ams1-1 already connected from 
> vm://ams1-1#0 | org.apache.activemq.broker.TransportConnection | 
> triggerStartAsyncNetworkBridgeCreation: remoteBroker=unconnected, 
> localBroker= vm://ams1-1#58408
> Not a single broker would connect to another broker, and the messages imply 
> that these connections already existed.  However, I could see that the 
> connections were trying to be established, using netstat, and the fact that 
> this message occured over and over, like they were retrying.  However, the 
> web-based admin console showed nothing under Network.  Not a single real 
> connection was made.
> After a lot of troubleshooting, especially looking into the LDAP 
> Authentication/Authorization settings and mechanism, I finally figured that 
> it had to be something persisted, because this exact same setup, without a 
> single configuration change, had been working perfectly before the storage 
> disconnect.
> In the end, I ended up completely deleting the LevelDB directory, and 
> restarting ActiveMQ on each node, and the setup is working flawlessly once 
> again.
> I haven't yet tried 5.13.0, and I'm pretty sure management isn't going to 
> allow me to cause a storage disconnect so I can test it, but I have a feeling 
> that some information about store & forward connections is stored in the 
> persistent store, and some sort of short-write occurred when the storage 
> disconnect happened.  However, since this data, whatever it may be, wasn't 
> cleared or reset at broker startup, the broker erroneously believed that the 
> connections I was trying to establish already existed.
> This may be an incorrect assumption, but at startup, the broker should reset 
> any data it has that pertains to store and forward connections, because 
> there's no way anything can really be connected at that time.
> I'll attach my configurations so that the environment, if not the storage 
> disconnect, can be replicated.
> The steps to reproduce, if they were practical would be:
> 1.) Setup an AMQ store & forward mesh based on the attached configurations, 
> and on VMWare ESX infrastructure.
> 2.) Cause a storage interruption.
> 3.) Reboot the VMs running AMQ to reset the read-only state of the block 
> devices, after the storage interruption.
> 4.) Try to bring the cluster back online.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to