[ 
https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493537#comment-13493537
 ] 

Timothy Bish commented on AMQ-4160:
-----------------------------------

Once you provide the updated patch we'll get this into a SNAPSHOT.
                
> DiscoveryNetworkConnector can lose track of active bridges, resulting in 
> permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race 
> conditions allow the {{bridges}} data structure to become corrupt and not 
> accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result 
> in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is 
> not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is 
> active; as a result, the bridge creation will fail (because of duplicate 
> connections), and be indefinitely retried, generating many WARN/ERROR log 
> messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge 
> is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate 
> onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, 
> event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad 
> URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a 
> result, concurrent attempts to add a bridge to the same URL can be allowed to 
> proceed to bridge creation.  Only one of the bridges will be established (the 
> other will fail when it attempts to add duplicate connections); however, the 
> failure of the second bridge will result in a call to 
> {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since 
> bridges are multi-threaded, it is possible that an exception will be handled 
> by a different thread at some time between {{bridge.start()}} and 
> {{bridges.put(uri, bridge}}.  In processing this exception, the thread will 
> call {{onServiceRemove(...}}, which will remove the (non-existent) 
> {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, 
> and adds the entry to {{bridges}} even though it is now stale and represents 
> a failed bridge.  Subsequent attempts to restore the bridge will be ignored, 
> and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by 
> AMQ-4159, which can result in many concurrent attempts to establish a bridge 
> to the same URL.  AMQ-4159 also described how multiple calls can be made to 
> {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the 
> second case described above (i.e., unexpected removal of a bridge that is 
> active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to 
> {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding 
> the entry to {{bridges}} prior to starting the bridge.  An additional 
> structure, {{activeEvents}}, keeps track of the event that represents the 
> current attempt to establish a bridge to a given remote URL --- this is used 
> to prevent multiple calls to {{onServiceRemove(...)}} from removing a 
> {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in 
> {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be 
> considered.  In particular, there is a tight-coupling between 
> {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not 
> evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} 
> assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in 
> a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call 
> to {{onServiceRemove(...)}} is necessary to clean up the resources used by 
> the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to