[
https://issues.apache.org/jira/browse/AMQ-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496722#comment-13496722
]
Stirling Chow commented on AMQ-2541:
------------------------------------
Hi Rob,
Sorry for the delay! I missed your comment, and we just ran into this issue
again when we upgraded to AMQ 5.6.0. I've attached an updated patch based on
the AMQ trunk. The patch uses the SimpleDiscoveryAgent's taskrunner to launch
concurrent bridge creation attempts at startup. The unit test verifies that
startup is concurrent. I also had to change EmbeddedBrokerTestSupport so that
it properly initialized the auto-fail thread.
> Extremely slow broker startup when using SimpleDiscoveryAgent with an
> inactive Network of Brokers.
> --------------------------------------------------------------------------------------------------
>
> Key: AMQ-2541
> URL: https://issues.apache.org/jira/browse/AMQ-2541
> Project: ActiveMQ
> Issue Type: Bug
> Components: Broker, Connector
> Affects Versions: 5.3.0
> Reporter: Stirling Chow
> Fix For: 5.x
>
> Attachments: AMQ2541.patch, AMQ2541Test.java
>
>
> Symptom
> ========
> An AMQ broker that is configured to join a statically-defined (i.e., using
> uri="static:(tcp://host1:61616,tcp://host2:61616,tcp://..) ) network of
> brokers can have an extremely long startup time (in the order of 5+ minutes),
> if many of the brokers in the network are not alive.
> The following log entires show the startup of an AMQ Broker
> (http://192.168.170.112:50000) that is configured to join a network with
> three other brokers:
> http://10.10.60.78:50000
> http://10.9.62.135:50000
> http://10.10.60.75:50000)
> The three other have not yet started.
> The log file shows that it takes nearly 4 minutes from BrokerService#start()
> to return control to the calling thread (AlarmPoint Node-main):
> 2009-12-18 15:24:46,783 [AlarmPoint Node-main] INFO - - ActiveMQ 5.3.0
> JMS Message Broker (localhost) is starting
> ...
> 2009-12-18 15:24:47,158 [AlarmPoint Node-main] INFO - - Connector
> http://192.168.170.112:50000 Started
> 2009-12-18 15:24:47,158 [AlarmPoint Node-main] INFO - - Establishing
> network connection from vm://localhost to http://10.10.60.78:50000
> ...
> 2009-12-18 15:26:11,314 [AlarmPoint Node-main] WARN - - Could not start
> network bridge between: vm://localhost and: http://10.10.60.78:50000 due to:
> java.net.ConnectException: Connection timed out: connect
> 2009-12-18 15:26:11,314 [AlarmPoint Node-main] DEBUG - - Start failure
> exception: java.net.ConnectException: Connection timed out: connect
> 2009-12-18 15:26:11,314 [AlarmPoint Node-main] INFO - - Establishing
> network connection from vm://localhost to http://10.9.62.135:50000
> ...
> 2009-12-18 15:27:35,299 [AlarmPoint Node-main] WARN - - Could not start
> network bridge between: vm://localhost and: http://10.9.62.135:50000 due to:
> java.net.ConnectException: Connection timed out: connect
> 2009-12-18 15:27:35,299 [AlarmPoint Node-main] DEBUG - - Start failure
> exception: java.net.ConnectException: Connection timed out: connect
> 2009-12-18 15:27:35,299 [AlarmPoint Node-main] INFO - - Establishing
> network connection from vm://localhost to http://10.10.60.75:50000
> ...
> 2009-12-18 15:28:59,314 [AlarmPoint Node-main] WARN - - Could not start
> network bridge between: vm://localhost and: http://10.10.60.75:50000 due to:
> java.net.ConnectException: Connection timed out: connect
> 2009-12-18 15:28:59,314 [AlarmPoint Node-main] DEBUG - - Start failure
> exception: java.net.ConnectException: Connection timed out: connect
> 2009-12-18 15:28:59,314 [AlarmPoint Node-main] INFO - - Network Connector
> bridge Started
> 2009-12-18 15:28:59,314 [AlarmPoint Node-main] INFO - - ActiveMQ JMS
> Message Broker (localhost, ID:vic-esx4-ns1-1280-1261178686846-0:0) started
> Cause
> =====
> The broker's network connector is implemented by
> org.apache.activemq.network.DiscoveryNetworkConnector, which in turn uses
> org.apache.activemq.transport.discovery.simple.SimpleDiscoveryAgent to
> determine whether the URLs configured in
> uri="static:(tcp://host1:61616,tcp://host2:61616,tcp://..) " are active.
> SimpleDiscoveryAgent#start() has this loop:
> public void start() throws Exception {
> running.set(true);
> for (int i = 0; i < services.length; i++) {
> listener.onServiceAdd(new SimpleDiscoveryEvent(services[i]));
> }
> }
> "listener.onServiceAdd(...) " is called for each URL and is implemented by
> DiscoveryNetworkConnector#onServiceAdd(...). The main thread calls
> BrokerService#start() which calls DiscoveryNetworkConnector#start() which
> calls SimpleDiscoveryAgent#start(), which sequentially calls
> DiscoveryNetworkConnector#onServiceAdd(...). Since the URLs being
> "discovered" are inactive, DiscoveryNetworkConnector#onServiceAdd(...)
> blocks ~1m30s (this will depend on network configuration) for each URL. This
> blocks the main thread that is trying to start the broker. If there are
> several inactive URLs, then the blocking time becomes excessive.
> Solution
> =======
> If you follow through the DiscoveryNetworkConnector#onServiceAdd(...) method,
> it eventially calls SimpleDiscoveryAgent#serviceFailed(...) for each inactive
> URL. In turn SimpleDiscoveryAgent#serviceFailed(...) launches an
> asynchronous task that pauses for the configured reconnect delay, and then
> retries the call to DiscoveryNetworkConnector#onServiceAdd(...). So it must
> be safe to call DiscoveryNetworkConnector#onServiceAdd(...) concurrently.
> Therefore, SimpleDiscoveryAgent#start()'s loop should be changed to launch
> asynchronous tasks to make the DiscoveryNetworkConnector#onServiceAdd(...)
> calls concurrently rather than synchronously.
> This solution has the benefit of returning control immediately to the caller
> of SimpleDiscoveryAgent#start(...), thus starting the broker faster, and
> allows the network discovery to find "active" URLs much faster (i.e., with
> the sequential loop, if the "active" URL is the last one, its discovery is
> significantly delayed).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira