[jira] Resolved: (AMQ-1855) bridge reconnection stops because of race in SimpleDiscoveryAgent

Gary Tully (JIRA) Mon, 31 Aug 2009 02:07:49 -0700

     [ 
https://issues.apache.org/activemq/browse/AMQ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gary Tully resolved AMQ-1855.
-----------------------------

    Resolution: Fixed

though I could not reproduce, the changes in revision 
http://svn.apache.org/viewcvs?view=rev&rev=808890 will fix this issue for both 
the simple discovery and multicast discovery providers.

> bridge reconnection stops because of race in SimpleDiscoveryAgent
> -----------------------------------------------------------------
>
>                 Key: AMQ-1855
>                 URL: https://issues.apache.org/activemq/browse/AMQ-1855
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Connector
>    Affects Versions: 4.1.2
>            Reporter: Mario Lukica
>            Assignee: Gary Tully
>             Fix For: 5.3.0
>
>
> I believe there is a race condition in SimpleDiscoveryAgent which can cause 
> subsequent bridge restart to fail, without starting new thread that should 
> restart a bridge. As a consequence, network bridge is never restarted.
> Following scenario leads to this:
> 1. bridge is disconnected (e.g. local error: 
> org.apache.activemq.transport.InactivityIOException: Channel was inactive for 
> too long)
> 2. bridge is disposed in separate thread in 
> DemandForwardingBridge.serviceLocalException
> 3. SimpleDiscoveryAgent.serviceFailed is called which starts up another 
> thread which calls DiscoveryNetworkConnector.onServiceAdd which tries to 
> restart bridge
> 4. bridge startup can cause javax.jms.InvalidClientIDException: Broker: 
> some_broker2 - Client: NC_some_broker1_inboundlocalhost already connected 
> (this one is caused by race condition with thread disposing the bridge, since 
> given client subscription should be removed by thread disposing the bridge 
> (step 2)
> 5. this causes invocation of DemandForwardingBridge.serviceLocalException 
> (this call can be made asynchronously, while previous bridge startup is still 
> in progress)
> As a consequence, multiple threads can end up calling 
> SimpleDiscoveryAgent.serviceFailed simultaneously.
> serviceFailed will call DiscoveryNetworkConnector.onServiceAdd which will try 
> to reconnect bridge. Reconnect logic is guarded by 
> if( event.failed.compareAndSet(false, true) ) 
> which tries to ensure that only a single thread is reconnecting bridge at 
> some point.
> {code}
>     public void serviceFailed(DiscoveryEvent devent) throws IOException {
>       
>         final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent;
>         if( event.failed.compareAndSet(false, true) ) {
>               
>                       listener.onServiceRemove(event);
>               Thread thread = new Thread() {
>                       public void run() {
>       
>       
>                               // We detect a failed connection attempt 
> because the service fails right
>                               // away.
>                               if( event.connectTime + minConnectTime > 
> System.currentTimeMillis()  ) {
>                                       
>                                       event.connectFailures++;
>                                       
>                                       if( maxReconnectAttempts>0 &&  
> event.connectFailures >= maxReconnectAttempts ) {
>                                               // Don' try to re-connect
>                                               return;
>                                       }
>                                       
>                               synchronized(sleepMutex){
>                                   try{
>                                       if( !running.get() )
>                                               return;
>                                       
>                                       sleepMutex.wait(event.reconnectDelay);
>                                   }catch(InterruptedException ie){
>                                 Thread.currentThread().interrupt();
>                                      return;
>                                   }
>                               }
>       
>                               if (!useExponentialBackOff) {
>                                   event.reconnectDelay = 
> initialReconnectDelay;
>                               } else {
>                                   // Exponential increment of reconnect delay.
>                                   event.reconnectDelay*=backOffMultiplier;
>                                   if(event.reconnectDelay>maxReconnectDelay)
>                                       event.reconnectDelay=maxReconnectDelay;
>                               }
>                               
>                               } else {
>                                       event.connectFailures = 0;
>                           event.reconnectDelay = initialReconnectDelay;
>                               }
>                                                                       
>                       if( !running.get() )
>                               return;
>                       
>                               event.connectTime = System.currentTimeMillis();
>                               event.failed.set(false);
>                               
>                               listener.onServiceAdd(event);
>                       }
>               };
>               thread.setDaemon(true);
>               thread.start();
>         }
>     }
> {code}
> Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set 
> to false (T1), and it's possible for some other thread (T2) to enter block 
> guarded by if( event.failed.compareAndSet(false, true) ) , while reconnect 
> process has already begun by first thread. T2 can satisfy condition: if( 
> event.connectTime + minConnectTime > System.currentTimeMillis()  )  and will 
> enter 
> sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true 
> (causing all other calls to serviceFailed not to start thread that will 
> reconnect bridge).
> If first thread (T1) fails to reconnect bridge (e.g because of 
> InvalidClientIDException described in step 4), it will not schedule new 
> thread to restart broker (and call DiscoveryNetworkConnector.onServiceRemove, 
> and cleanup DiscoveryNetworkConnector.bridges) because of event.failed == 
> true, and T2 still waiting (default 5 sec). When T2 wakes up from wait, it 
> will try to restart broker and fail because of following condition in 
> DiscoveryNetworkConnector:
> {code}
>             if (    bridges.containsKey(uri) 
>                     || localURI.equals(uri) 
>                     || (connectionFilter!=null && 
> !connectionFilter.connectTo(uri))
>                     )
>                 return;
> {code}
> bridges.containsKey(uri) will be true (thread T1 added it while 
> unsuccessfully trying to reconnect bridge), and T2 will return from 
> DiscoveryNetworkConnector.onServiceAdd and will not start bridge. 
> No additional attempt to reconnect bridge will be made, since T2 held 
> event.failed == true, effectively ignoring SimpleDiscoveryAgent.serviceFailed 
> calls from other threads processing local or remote bridge exceptions.
> End result:
> - DiscoveryNetworkConnector.bridges contains bridge that is disposed and 
> prevents all other attempts to restart bridge (onServiceAdd always returns 
> because bridges.containsKey(uri) == true) 
> - SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last 
> attempt which returned without restarting the bridge - 
> SimpleDiscoveryAgent.serviceFailed is not called again, since bridge is not 
> started
> I think that synchronization of threads processing bridge exceptions and 
> entering SimpleDiscoveryAgent.serviceFailed should be verified and/or 
> improved.
> Also, InvalidClientIDException is relatively common (at least on multicore 
> machines, e.g. Solaris T2000), maybe ConduitBridge.serviceLocalException 
> (which starts another thread doing 
> ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be 
> changed to wait a bit for bridge disposal to finish (e.g. sleep for some 
> time) and then try to restart a bridge - waiting for a second more to restart 
> a bridge is better then not to start it at all
> I've seen this problem in 4.1.0 and 4.1.2, but I think it can occur in 5.1 
> and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed and 
> DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using 
> ASYNC_TASKS to execute asynchronous calls, instead of starting new threads 
> directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (AMQ-1855) bridge reconnection stops because of race in SimpleDiscoveryAgent

Reply via email to