[
https://issues.apache.org/jira/browse/AMQ-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506927#comment-13506927
]
Stirling Chow commented on AMQ-4160:
------------------------------------
Great to hear, Timothy. Thanks for working through this with me.
> DiscoveryNetworkConnector can lose track of active bridges, resulting in
> permanent bridge failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: AMQ-4160
> URL: https://issues.apache.org/jira/browse/AMQ-4160
> Project: ActiveMQ
> Issue Type: Bug
> Reporter: Stirling Chow
> Assignee: Timothy Bish
> Priority: Critical
> Fix For: 5.8.0
>
> Attachments: AMQ4160_2.patch, AMQ4160.patch, AMQ4160Test.java
>
>
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race
> conditions allow the {{bridges}} data structure to become corrupt and not
> accurately represent the bridges that exist.
> Because {{bridges}} is used to control whether a discovery event will result
> in a bridge creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is
> not active; as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is
> active; as a result, the bridge creation will fail (because of duplicate
> connections), and be indefinitely retried, generating many WARN/ERROR log
> messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge
> is created or removed:
> {code:title=DiscoveryNetworkConnector.java}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
> // Should we try to connect to that URI?
> synchronized (bridges) {
> if( bridges.containsKey(uri) ) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Discovery agent generated a duplicate
> onServiceAdd event for: "+uri );
> }
> return;
> }
> }
> ...
> NetworkBridge bridge = createBridge(localTransport, remoteTransport,
> event);
> try {
> bridge.start();
> synchronized (bridges) {
> bridges.put(uri, bridge);
> }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
> String url = event.getServiceName();
> if (url != null) {
> URI uri;
> try {
> uri = new URI(url);
> } catch (URISyntaxException e) {
> LOG.warn("Could not connect to remote URI: " + url + " due to bad
> URI syntax: " + e, e);
> return;
> }
> synchronized (bridges) {
> bridges.remove(uri);
> }
> }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic. As a
> result, concurrent attempts to add a bridge to the same URL can be allowed to
> proceed to bridge creation. Only one of the bridges will be established (the
> other will fail when it attempts to add duplicate connections); however, the
> failure of the second bridge will result in a call to
> {{onServiceRemove(...)}} that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}. Since
> bridges are multi-threaded, it is possible that an exception will be handled
> by a different thread at some time between {{bridge.start()}} and
> {{bridges.put(uri, bridge}}. In processing this exception, the thread will
> call {{onServiceRemove(...}}, which will remove the (non-existent)
> {{bridges}} entry. Subseqently, {{bridges.put(uri, bridge)}} is executed,
> and adds the entry to {{bridges}} even though it is now stale and represents
> a failed bridge. Subsequent attempts to restore the bridge will be ignored,
> and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by
> AMQ-4159, which can result in many concurrent attempts to establish a bridge
> to the same URL. AMQ-4159 also described how multiple calls can be made to
> {{onServiceRemove(...)}}, which can result in mal-behaviour similar to the
> second case described above (i.e., unexpected removal of a bridge that is
> active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to
> {{DiscoveryNetworkConnector}} by making the check-and-set atomic and adding
> the entry to {{bridges}} prior to starting the bridge. An additional
> structure, {{activeEvents}}, keeps track of the event that represents the
> current attempt to establish a bridge to a given remote URL --- this is used
> to prevent multiple calls to {{onServiceRemove(...)}} from removing a
> {{bridges}} entry that was *not* added by the corresponding discovery event.
> This patch is a band aid to the design flaws in
> {{DiscoveryNetworkConnector}}, and a refactoring of the connector should be
> considered. In particular, there is a tight-coupling between
> {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not
> evident through their interfaces. For example, {{DiscoveryNetworkConnector}}
> assumes that the call to {{discoveryAgent.serviceFailed(...)}} will result in
> a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}. The call
> to {{onServiceRemove(...)}} is necessary to clean up the resources used by
> the bridge, but that requirement is not explicit, and therefore easily missed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira