[ 
https://issues.apache.org/jira/browse/AMQ-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petr Janata updated AMQ-6248:
-----------------------------
    Attachment: AMQ-6248.patch.svndiff

Here is the patch. It is an output from svn diff command.

> Failover - transport connected to one broker fails due to error in connection 
> to another broker
> -----------------------------------------------------------------------------------------------
>
>                 Key: AMQ-6248
>                 URL: https://issues.apache.org/jira/browse/AMQ-6248
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Transport
>            Reporter: Petr Janata
>              Labels: failover, race-condition
>         Attachments: AMQ-6248.patch.svndiff
>
>
> There is a bug in the  {{FailoverTransport}} which is triggered by a race 
> condition. The client log contains message:
> {{WARN | ActiveMQ Transport: *URI1* \[FailoverTransport] Transport (*URI2*) 
> failed, attempting to automatically reconnect}}
> The exact impact on client failover differs with each setup and environment. 
> In our case this forced client to infinitely switch between two available 
> brokers.
> Assume client is configured to use broker URL in form
> {{failover:(URI1,URI2)?randomize=false}}.
> Assume that broker with URI1 is down and the other broker URI2 is running 
> fine. This is normal master/slave setup. 
> Client tries to establish connection and the following happens:
> 1. URI1 is tried, it fails because this broker is not reachable (down or 
> waiting slave)
> 2. URI2 is tried, it succeeds because this broker is currently the 'master'
> 3. Exception from thread of transport to URI1 causes failure in transport to 
> URI2
> 4. Try another transport in the list. Oh wait, its URI1 -> go to 1.
> Impact for different configurations might not be that severe. But 
> unfortunately in our case we were not able to avoid this bug no matter the 
> configuration. For example {{randomize=true}} helped a little, but still the 
> inifinite loop happens 1/2 of the time.
> The bug is caused by a single shared instance {{myTransportListener}} of 
> {{TransportListener}} in {{FailoverTransport}} class. {{doReconnect()}} tries 
> to start transport to URI1 and registers the listener on it. Transport fails 
> to start and the next transport to URI2 is tried. But the listener is not 
> unregistered from the failed transport URI1. Failures that happen on 
> transport URI1 may call in its own thread the listener method 
> {{onException()}}. This call will get to {{handleTransportFailure()}} where 
> it waits for the {{reconnectMutex}}. The reconnect task thread continues, 
> establishes Transport URI2, sets it to {{connectedTransport}}=URI2, releases 
> the reconnectMutex. The thread of transport URI1 unblocks in 
> handleTransportFailure() and destroys the connectedTransport=URI2.
> I have created a patch against version 5.11 that deals specifically with this 
> problem.
> The change is that instead of the single shared myTrasnportListener instance 
> there is a new listener created for each new transport.
> Each new listener keeps reference to the transport it was assigned to. The 
> listener will cause failover only if the exception is coming from the 
> transport which is currently connected.
> I didn't care about the other methods of the listener, but these probably 
> need the same restriction.
> This bug is present in all versions from version 4.0 (I didn't go deeper). 
> The idea in the patch should be applicable for all versions.
> Btw. log message mentioned in AMQ-4986 contains the same URI1 vs URI2 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to