Petr Janata created AMQ-6248:
--------------------------------

             Summary: Failover - transport connected to one broker fails due to 
error in connection to another broker
                 Key: AMQ-6248
                 URL: https://issues.apache.org/jira/browse/AMQ-6248
             Project: ActiveMQ
          Issue Type: Bug
          Components: Transport
            Reporter: Petr Janata


There is a bug in the  {{FailoverTransport}} which is triggered by a race 
condition. The client log contains message:
{{WARN | ActiveMQ Transport: *URI1* \[FailoverTransport] Transport (*URI2*) 
failed, attempting to automatically reconnect}}

The exact impact on client failover differs with each setup and environment. In 
our case this forced client to infinitely switch between two available brokers.

Assume client is configured to use broker URL in form
{{failover:(URI1,URI2)?randomize=false}}.
Assume that broker with URI1 is down and the other broker URI2 is running fine. 
This is normal master/slave setup. 

Client tries to establish connection and the following happens:
1. URI1 is tried, it fails because this broker is not reachable (down or 
waiting slave)
2. URI2 is tried, it succeeds because this broker is currently the 'master'
3. Exception from thread of transport to URI1 causes failure in transport to 
URI2
4. Try another transport in the list. Oh wait, its URI1 -> go to 1.

Impact for different configurations might not be that severe. But unfortunately 
in our case we were not able to avoid this bug no matter the configuration. For 
example {{randomize=true}} helped a little, but still the inifinite loop 
happens 1/2 of the time.

The bug is caused by a single shared instance {{myTransportListener}} of 
{{TransportListener}} in {{FailoverTransport}} class. {{doReconnect()}} tries 
to start transport to URI1 and registers the listener on it. Transport fails to 
start and the next transport to URI2 is tried. But the listener is not 
unregistered from the failed transport URI1. Failures that happen on transport 
URI1 may call in its own thread the listener method {{onException()}}. This 
call will get to {{handleTransportFailure()}} where it waits for the 
{{reconnectMutex}}. The reconnect task thread continues, establishes Transport 
URI2, sets it to {{connectedTransport}}=URI2, releases the reconnectMutex. The 
thread of transport URI1 unblocks in handleTransportFailure() and destroys the 
connectedTransport=URI2.

I have created a patch against version 5.11 that deals specifically with this 
problem.
The change is that instead of the single shared myTrasnportListener instance 
there is a new listener created for each new transport.
Each new listener keeps reference to the transport it was assigned to. The 
listener will cause failover only if the exception is coming from the transport 
which is currently connected.
I didn't care about the other methods of the listener, but these probably need 
the same restriction.

This bug is present in all versions from version 4.0 (I didn't go deeper). The 
idea in the patch should be applicable for all versions.

Btw. log message mentioned in AMQ-4986 contains the same URI1 vs URI2 problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to