[ https://issues.apache.org/activemq/browse/AMQ-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=53828#action_53828 ]
Gary Tully commented on AMQ-1855: --------------------------------- the enhancements in http://svn.apache.org/viewvc?rev=805361&view=rev can help configure a faster timeout on a discovered network connector. > bridge reconnection stops because of race in SimpleDiscoveryAgent > ----------------------------------------------------------------- > > Key: AMQ-1855 > URL: https://issues.apache.org/activemq/browse/AMQ-1855 > Project: ActiveMQ > Issue Type: Bug > Components: Connector > Affects Versions: 4.1.2 > Reporter: Mario Lukica > Assignee: Gary Tully > Fix For: 5.3.0 > > > I believe there is a race condition in SimpleDiscoveryAgent which can cause > subsequent bridge restart to fail, without starting new thread that should > restart a bridge. As a consequence, network bridge is never restarted. > Following scenario leads to this: > 1. bridge is disconnected (e.g. local error: > org.apache.activemq.transport.InactivityIOException: Channel was inactive for > too long) > 2. bridge is disposed in separate thread in > DemandForwardingBridge.serviceLocalException > 3. SimpleDiscoveryAgent.serviceFailed is called which starts up another > thread which calls DiscoveryNetworkConnector.onServiceAdd which tries to > restart bridge > 4. bridge startup can cause javax.jms.InvalidClientIDException: Broker: > some_broker2 - Client: NC_some_broker1_inboundlocalhost already connected > (this one is caused by race condition with thread disposing the bridge, since > given client subscription should be removed by thread disposing the bridge > (step 2) > 5. this causes invocation of DemandForwardingBridge.serviceLocalException > (this call can be made asynchronously, while previous bridge startup is still > in progress) > As a consequence, multiple threads can end up calling > SimpleDiscoveryAgent.serviceFailed simultaneously. > serviceFailed will call DiscoveryNetworkConnector.onServiceAdd which will try > to reconnect bridge. Reconnect logic is guarded by > if( event.failed.compareAndSet(false, true) ) > which tries to ensure that only a single thread is reconnecting bridge at > some point. > {code} > public void serviceFailed(DiscoveryEvent devent) throws IOException { > > final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent; > if( event.failed.compareAndSet(false, true) ) { > > listener.onServiceRemove(event); > Thread thread = new Thread() { > public void run() { > > > // We detect a failed connection attempt > because the service fails right > // away. > if( event.connectTime + minConnectTime > > System.currentTimeMillis() ) { > > event.connectFailures++; > > if( maxReconnectAttempts>0 && > event.connectFailures >= maxReconnectAttempts ) { > // Don' try to re-connect > return; > } > > synchronized(sleepMutex){ > try{ > if( !running.get() ) > return; > > sleepMutex.wait(event.reconnectDelay); > }catch(InterruptedException ie){ > Thread.currentThread().interrupt(); > return; > } > } > > if (!useExponentialBackOff) { > event.reconnectDelay = > initialReconnectDelay; > } else { > // Exponential increment of reconnect delay. > event.reconnectDelay*=backOffMultiplier; > if(event.reconnectDelay>maxReconnectDelay) > event.reconnectDelay=maxReconnectDelay; > } > > } else { > event.connectFailures = 0; > event.reconnectDelay = initialReconnectDelay; > } > > if( !running.get() ) > return; > > event.connectTime = System.currentTimeMillis(); > event.failed.set(false); > > listener.onServiceAdd(event); > } > }; > thread.setDaemon(true); > thread.start(); > } > } > {code} > Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set > to false (T1), and it's possible for some other thread (T2) to enter block > guarded by if( event.failed.compareAndSet(false, true) ) , while reconnect > process has already begun by first thread. T2 can satisfy condition: if( > event.connectTime + minConnectTime > System.currentTimeMillis() ) and will > enter > sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true > (causing all other calls to serviceFailed not to start thread that will > reconnect bridge). > If first thread (T1) fails to reconnect bridge (e.g because of > InvalidClientIDException described in step 4), it will not schedule new > thread to restart broker (and call DiscoveryNetworkConnector.onServiceRemove, > and cleanup DiscoveryNetworkConnector.bridges) because of event.failed == > true, and T2 still waiting (default 5 sec). When T2 wakes up from wait, it > will try to restart broker and fail because of following condition in > DiscoveryNetworkConnector: > {code} > if ( bridges.containsKey(uri) > || localURI.equals(uri) > || (connectionFilter!=null && > !connectionFilter.connectTo(uri)) > ) > return; > {code} > bridges.containsKey(uri) will be true (thread T1 added it while > unsuccessfully trying to reconnect bridge), and T2 will return from > DiscoveryNetworkConnector.onServiceAdd and will not start bridge. > No additional attempt to reconnect bridge will be made, since T2 held > event.failed == true, effectively ignoring SimpleDiscoveryAgent.serviceFailed > calls from other threads processing local or remote bridge exceptions. > End result: > - DiscoveryNetworkConnector.bridges contains bridge that is disposed and > prevents all other attempts to restart bridge (onServiceAdd always returns > because bridges.containsKey(uri) == true) > - SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last > attempt which returned without restarting the bridge - > SimpleDiscoveryAgent.serviceFailed is not called again, since bridge is not > started > I think that synchronization of threads processing bridge exceptions and > entering SimpleDiscoveryAgent.serviceFailed should be verified and/or > improved. > Also, InvalidClientIDException is relatively common (at least on multicore > machines, e.g. Solaris T2000), maybe ConduitBridge.serviceLocalException > (which starts another thread doing > ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be > changed to wait a bit for bridge disposal to finish (e.g. sleep for some > time) and then try to restart a bridge - waiting for a second more to restart > a bridge is better then not to start it at all > I've seen this problem in 4.1.0 and 4.1.2, but I think it can occur in 5.1 > and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed and > DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using > ASYNC_TASKS to execute asynchronous calls, instead of starting new threads > directly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.