Petr Janata created AMQ-6248:
--------------------------------
Summary: Failover - transport connected to one broker fails due to
error in connection to another broker
Key: AMQ-6248
URL: https://issues.apache.org/jira/browse/AMQ-6248
Project: ActiveMQ
Issue Type: Bug
Components: Transport
Reporter: Petr Janata
There is a bug in the {{FailoverTransport}} which is triggered by a race
condition. The client log contains message:
{{WARN | ActiveMQ Transport: *URI1* \[FailoverTransport] Transport (*URI2*)
failed, attempting to automatically reconnect}}
The exact impact on client failover differs with each setup and environment. In
our case this forced client to infinitely switch between two available brokers.
Assume client is configured to use broker URL in form
{{failover:(URI1,URI2)?randomize=false}}.
Assume that broker with URI1 is down and the other broker URI2 is running fine.
This is normal master/slave setup.
Client tries to establish connection and the following happens:
1. URI1 is tried, it fails because this broker is not reachable (down or
waiting slave)
2. URI2 is tried, it succeeds because this broker is currently the 'master'
3. Exception from thread of transport to URI1 causes failure in transport to
URI2
4. Try another transport in the list. Oh wait, its URI1 -> go to 1.
Impact for different configurations might not be that severe. But unfortunately
in our case we were not able to avoid this bug no matter the configuration. For
example {{randomize=true}} helped a little, but still the inifinite loop
happens 1/2 of the time.
The bug is caused by a single shared instance {{myTransportListener}} of
{{TransportListener}} in {{FailoverTransport}} class. {{doReconnect()}} tries
to start transport to URI1 and registers the listener on it. Transport fails to
start and the next transport to URI2 is tried. But the listener is not
unregistered from the failed transport URI1. Failures that happen on transport
URI1 may call in its own thread the listener method {{onException()}}. This
call will get to {{handleTransportFailure()}} where it waits for the
{{reconnectMutex}}. The reconnect task thread continues, establishes Transport
URI2, sets it to {{connectedTransport}}=URI2, releases the reconnectMutex. The
thread of transport URI1 unblocks in handleTransportFailure() and destroys the
connectedTransport=URI2.
I have created a patch against version 5.11 that deals specifically with this
problem.
The change is that instead of the single shared myTrasnportListener instance
there is a new listener created for each new transport.
Each new listener keeps reference to the transport it was assigned to. The
listener will cause failover only if the exception is coming from the transport
which is currently connected.
I didn't care about the other methods of the listener, but these probably need
the same restriction.
This bug is present in all versions from version 4.0 (I didn't go deeper). The
idea in the patch should be applicable for all versions.
Btw. log message mentioned in AMQ-4986 contains the same URI1 vs URI2 problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)