[ 
https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stirling Chow updated AMQ-4160:
-------------------------------

    Description: 
Symptom
=======
{{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race 
conditions allow the {{bridges}} data structure to become corrupt and not 
accurately represent the bridges that exist.

Because {{bridges}} is used to control whether a discovery event will result in 
a bridge creation attempt, if it is not accurate, two results are possible:

# A discovery event will be ignored even though its corresponding bridge is not 
active; as a result, the bridge will never be established
# A discovery event will be processed even though its corresponding bridge is 
active; as a result, the bridge creation will fail (because of duplicate 
connections), and be indefinitely retried, generating many WARN/ERROR log 
messages

Cause
=====
{{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge 
is created or removed:

{code:title=DiscoveryNetworkConnector.java}
public void onServiceAdd(DiscoveryEvent event) {
...
        // Should we try to connect to that URI?
        synchronized (bridges) {
            if( bridges.containsKey(uri) ) {
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Discovery agent generated a duplicate 
onServiceAdd event for: "+uri );
                }
                return;
            }
        }
...
        NetworkBridge bridge = createBridge(localTransport, remoteTransport, 
event);
        try {
            bridge.start();
            synchronized (bridges) {
                bridges.put(uri, bridge);
            }
...
}


public void onServiceRemove(DiscoveryEvent event) {
    String url = event.getServiceName();
    if (url != null) {
        URI uri;
        try {
            uri = new URI(url);
        } catch (URISyntaxException e) {
            LOG.warn("Could not connect to remote URI: " + url + " due to bad 
URI syntax: " + e, e);
            return;
        }

        synchronized (bridges) {
            bridges.remove(uri);
        }
    }
}
{code}

There are a number of problems:

# The check-and-set for adding an entry {{bridges}} is not atomic.  As a 
result, concurrent attempts to add a bridge to the same URL can be allowed to 
proceed to bridge creation.  Only one of the bridges will be established (the 
other will fail when it attempts to add duplicate connections); however, the 
failure of the second bridge will result in a call to {{onServiceRemove(...)}} 
that will remove the single/shared {{bridges}} entry.
# The bridge is started before an entry is added to {{bridges}}.  Since bridges 
are multi-threaded, it is possible that an exception will be handled by a 
different thread at some time between {{bridge.start()}} and {{bridges.put(uri, 
bridge}}.  In processing this exception, the thread will call 
{{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} 
entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the 
entry to {{bridges}} even though it is now stale and represents a failed 
bridge.  Subsequent attempts to restore the bridge will be ignored, and no 
active bridge will be created.

The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by 
AMQ-4159, which can result in many concurrent attempts to establish a bridge to 
the same URL.  AMQ-4159 also described how multiple calls can be made to 
{{onServiceRemove(...)}}, which can result in mal-behaviour similar to the 
second case described above (i.e., unexpected removal of a bridge that is 
active).

Solution
========
The attached patch attempts to restore some thread-safety to 
{{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the 
entry to {{bridges}} prior to starting the bridge.  An additional structure, 
{{activeEvents}}, keeps track of the event that represents the current attempt 
to establish a bridge to a given remote URL --- this is used to prevent 
multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry 
that was *not* added by the corresponding discovery event.  

This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, 
and a refactoring of the connector should be considered.  In particular, there 
is a tight-coupling between {{DiscoveryNetworkConnector}} and 
{{SimpleDiscoveryAgent}} that is not evident through their interfaces.  For 
example, {{DiscoveryNetworkConnector}} assumes that the call to 
{{discoveryAgent.serviceFailed(...)}} will result in a call back to 
{{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call to 
{{onServiceRemove(...)}} is necessary to clean up the resources used by the 
bridge, but that requirement is not explicit, and therefore easily missed.

  was:
Symptom
=======
{{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race 
conditions allow the {{bridges}} data structure to become corrupt and not 
accurately represent the bridges that exist.

Because {{bridges}} is used to control whether a discovery event will result in 
a bridge creation attempt, if it is not accurate, two results are possible:

# A discovery event will be ignored even though its corresponding bridge is not 
active; as a result, the bridge will never be established
# A discovery event will be processed even though its corresponding bridge is 
active; as a result, the bridge creation will fail (because of duplicate 
connections), and be indefinitely retried, generating many WARN/ERROR log 
messages

Cause
=====
{{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge 
is created or removed:

{code:title=DiscoveryNetworkConnector.java}
public void onServiceAdd(DiscoveryEvent event) {
...
        // Should we try to connect to that URI?
        synchronized (bridges) {
            if( bridges.containsKey(uri) ) {
                if (LOG.isDebugEnabled()) {
                    LOG.debug("Discovery agent generated a duplicate 
onServiceAdd event for: "+uri );
                }
                return;
            }
        }
...
        NetworkBridge bridge = createBridge(localTransport, remoteTransport, 
event);
        try {
            bridge.start();
            synchronized (bridges) {
                bridges.put(uri, bridge);
            }
...
}


public void onServiceRemove(DiscoveryEvent event) {
    String url = event.getServiceName();
    if (url != null) {
        URI uri;
        try {
            uri = new URI(url);
        } catch (URISyntaxException e) {
            LOG.warn("Could not connect to remote URI: " + url + " due to bad 
URI syntax: " + e, e);
            return;
        }

        synchronized (bridges) {
            bridges.remove(uri);
        }
    }
}
{code}

There are a number of problems:

# The check-and-set for adding an entry {{bridges}} is not atomic.  As a 
result, concurrent attempts to add a bridge to the same URL can be allowed to 
proceed to bridge creation.  Only one of the bridges will be established (the 
other will fail when it attempts to add duplicate connections); however, the 
failure of the second bridge will result in a call to {{onServiceRemove(...)}} 
that will remove the single/shared {{bridges}} entry.
# The bridge is started before an entry is added to {{bridges}}.  Since bridges 
are multi-threaded, it is possible that an exception will be handled by a 
different thread at some time between {{bridge.start()}} and {{bridges.put(uri, 
bridge}}.  In processing this exception, the thread will call 
{{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} 
entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the 
entry to {{bridges}} even though it is now stale and represents a failed 
bridge.  Subsequent attempts to restore the bridge will be ignored, and no 
active bridge will be created.

The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by 
AMQ-4159, which can result in many concurrent attempts to establish a bridge to 
the same URL.  AMQ-4159 also described how multiple calls can be made to 
{{onServiceRemove(...)}}, which can result in mal-behaviour similar to the 
second case described above (i.e., unexpected removal of a bridge that is 
active).

Solution
========
The attached patch attempts to restore some thread-safety to 
{{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding the 
entry to {{bridges}} prior to starting the bridge.  An additional structure, 
{{activeEvents}}, keeps track of the event that represents the current attempt 
to establish a bridge to a given remote URL --- this is used to prevent 
multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry 
that was *not* added by the corresponding discovery event.  

    
> DiscoveryNetworkConnector can loss track of active bridges, resulting in 
> permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMQ-4160
>                 URL: https://issues.apache.org/jira/browse/AMQ-4160
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Priority: Critical
>             Fix For: 5.8.0
>
>         Attachments: AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race 
> conditions allow the {{bridges}} data structure to become corrupt and not 
> accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result 
> in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is 
> not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is 
> active; as a result, the bridge creation will fail (because of duplicate 
> connections), and be indefinitely retried, generating many WARN/ERROR log 
> messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge 
> is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate 
> onServiceAdd event for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, 
> event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad 
> URI syntax: " + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a 
> result, concurrent attempts to add a bridge to the same URL can be allowed to 
> proceed to bridge creation.  Only one of the bridges will be established (the 
> other will fail when it attempts to add duplicate connections); however, the 
> failure of the second bridge will result in a call to 
> {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since 
> bridges are multi-threaded, it is possible that an exception will be handled 
> by a different thread at some time between {{bridge.start()}} and 
> {{bridges.put(uri, bridge}}.  In processing this exception, the thread will 
> call {{onServiceRemove(...}}, which will remove the (non-existent) 
> {{bridges}} entry.  Subseqently, {{bridges.put(uri, bridge)}} is executed, 
> and adds the entry to {{bridges}} even though it is now stale and represents 
> a failed bridge.  Subsequent attempts to restore the bridge will be ignored, 
> and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by 
> AMQ-4159, which can result in many concurrent attempts to establish a bridge 
> to the same URL.  AMQ-4159 also described how multiple calls can be made to 
> {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the 
> second case described above (i.e., unexpected removal of a bridge that is 
> active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to 
> {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding 
> the entry to {{bridges}} prior to starting the bridge.  An additional 
> structure, {{activeEvents}}, keeps track of the event that represents the 
> current attempt to establish a bridge to a given remote URL --- this is used 
> to prevent multiple calls to {{onServiceRemove(...)}} from removing a 
> {{bridges}} entry that was *not* added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in 
> {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be 
> considered.  In particular, there is a tight-coupling between 
> {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not 
> evident through their interfaces.  For example, {{DiscoveryNetworkConnector}} 
> assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in 
> a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call 
> to {{onServiceRemove(...)}} is necessary to clean up the resources used by 
> the bridge, but that requirement is not explicit, and therefore easily missed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to