[
https://issues.apache.org/jira/browse/AMQ-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Petr Janata updated AMQ-6248:
-----------------------------
Attachment: AMQ-6248.patch.svndiff
Here is the patch. It is an output from svn diff command.
> Failover - transport connected to one broker fails due to error in connection
> to another broker
> -----------------------------------------------------------------------------------------------
>
> Key: AMQ-6248
> URL: https://issues.apache.org/jira/browse/AMQ-6248
> Project: ActiveMQ
> Issue Type: Bug
> Components: Transport
> Reporter: Petr Janata
> Labels: failover, race-condition
> Attachments: AMQ-6248.patch.svndiff
>
>
> There is a bug in the {{FailoverTransport}} which is triggered by a race
> condition. The client log contains message:
> {{WARN | ActiveMQ Transport: *URI1* \[FailoverTransport] Transport (*URI2*)
> failed, attempting to automatically reconnect}}
> The exact impact on client failover differs with each setup and environment.
> In our case this forced client to infinitely switch between two available
> brokers.
> Assume client is configured to use broker URL in form
> {{failover:(URI1,URI2)?randomize=false}}.
> Assume that broker with URI1 is down and the other broker URI2 is running
> fine. This is normal master/slave setup.
> Client tries to establish connection and the following happens:
> 1. URI1 is tried, it fails because this broker is not reachable (down or
> waiting slave)
> 2. URI2 is tried, it succeeds because this broker is currently the 'master'
> 3. Exception from thread of transport to URI1 causes failure in transport to
> URI2
> 4. Try another transport in the list. Oh wait, its URI1 -> go to 1.
> Impact for different configurations might not be that severe. But
> unfortunately in our case we were not able to avoid this bug no matter the
> configuration. For example {{randomize=true}} helped a little, but still the
> inifinite loop happens 1/2 of the time.
> The bug is caused by a single shared instance {{myTransportListener}} of
> {{TransportListener}} in {{FailoverTransport}} class. {{doReconnect()}} tries
> to start transport to URI1 and registers the listener on it. Transport fails
> to start and the next transport to URI2 is tried. But the listener is not
> unregistered from the failed transport URI1. Failures that happen on
> transport URI1 may call in its own thread the listener method
> {{onException()}}. This call will get to {{handleTransportFailure()}} where
> it waits for the {{reconnectMutex}}. The reconnect task thread continues,
> establishes Transport URI2, sets it to {{connectedTransport}}=URI2, releases
> the reconnectMutex. The thread of transport URI1 unblocks in
> handleTransportFailure() and destroys the connectedTransport=URI2.
> I have created a patch against version 5.11 that deals specifically with this
> problem.
> The change is that instead of the single shared myTrasnportListener instance
> there is a new listener created for each new transport.
> Each new listener keeps reference to the transport it was assigned to. The
> listener will cause failover only if the exception is coming from the
> transport which is currently connected.
> I didn't care about the other methods of the listener, but these probably
> need the same restriction.
> This bug is present in all versions from version 4.0 (I didn't go deeper).
> The idea in the patch should be applicable for all versions.
> Btw. log message mentioned in AMQ-4986 contains the same URI1 vs URI2 problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)