Extremely slow broker startup when using SimpleDiscoveryAgent with an inactive 
Network of Brokers.
--------------------------------------------------------------------------------------------------

                 Key: AMQ-2541
                 URL: https://issues.apache.org/activemq/browse/AMQ-2541
             Project: ActiveMQ
          Issue Type: Bug
          Components: Broker, Connector
    Affects Versions: 5.3.0
            Reporter: Stirling Chow


Symptom
========
An AMQ broker that is configured to join a statically-defined (i.e., using 
uri="static:(tcp://host1:61616,tcp://host2:61616,tcp://..) ) network of brokers 
can have an extremely long startup time (in the order of 5+ minutes), if many 
of the brokers in the network are not alive.

The following log entires show the startup of an AMQ Broker 
(http://192.168.170.112:50000) that is configured to join a network with three 
other brokers:
http://10.10.60.78:50000
http://10.9.62.135:50000
http://10.10.60.75:50000)

The three other have not yet started.

The log file shows that it takes nearly 4 minutes from BrokerService#start() to 
return control to the calling thread (AlarmPoint Node-main):

2009-12-18 15:24:46,783 [AlarmPoint Node-main] INFO    -  - ActiveMQ 5.3.0 JMS 
Message Broker (localhost) is starting
...
2009-12-18 15:24:47,158 [AlarmPoint Node-main] INFO    -  - Connector 
http://192.168.170.112:50000 Started
2009-12-18 15:24:47,158 [AlarmPoint Node-main] INFO    -  - Establishing 
network connection from vm://localhost to http://10.10.60.78:50000
...
2009-12-18 15:26:11,314 [AlarmPoint Node-main] WARN    -  - Could not start 
network bridge between: vm://localhost and: http://10.10.60.78:50000 due to: 
java.net.ConnectException: Connection timed out: connect
2009-12-18 15:26:11,314 [AlarmPoint Node-main] DEBUG   -  - Start failure 
exception: java.net.ConnectException: Connection timed out: connect
2009-12-18 15:26:11,314 [AlarmPoint Node-main] INFO    -  - Establishing 
network connection from vm://localhost to http://10.9.62.135:50000
...
2009-12-18 15:27:35,299 [AlarmPoint Node-main] WARN    -  - Could not start 
network bridge between: vm://localhost and: http://10.9.62.135:50000 due to: 
java.net.ConnectException: Connection timed out: connect
2009-12-18 15:27:35,299 [AlarmPoint Node-main] DEBUG   -  - Start failure 
exception: java.net.ConnectException: Connection timed out: connect
2009-12-18 15:27:35,299 [AlarmPoint Node-main] INFO    -  - Establishing 
network connection from vm://localhost to http://10.10.60.75:50000
...
2009-12-18 15:28:59,314 [AlarmPoint Node-main] WARN    -  - Could not start 
network bridge between: vm://localhost and: http://10.10.60.75:50000 due to: 
java.net.ConnectException: Connection timed out: connect
2009-12-18 15:28:59,314 [AlarmPoint Node-main] DEBUG   -  - Start failure 
exception: java.net.ConnectException: Connection timed out: connect
2009-12-18 15:28:59,314 [AlarmPoint Node-main] INFO    -  - Network Connector 
bridge Started
2009-12-18 15:28:59,314 [AlarmPoint Node-main] INFO    -  - ActiveMQ JMS 
Message Broker (localhost, ID:vic-esx4-ns1-1280-1261178686846-0:0) started

Cause
=====
The broker's network connector is implemented by 
org.apache.activemq.network.DiscoveryNetworkConnector, which in turn uses 
org.apache.activemq.transport.discovery.simple.SimpleDiscoveryAgent to 
determine whether the URLs configured in 
uri="static:(tcp://host1:61616,tcp://host2:61616,tcp://..) " are active.  
SimpleDiscoveryAgent#start() has this loop:

    public void start() throws Exception {
        running.set(true);
        for (int i = 0; i < services.length; i++) {
            listener.onServiceAdd(new SimpleDiscoveryEvent(services[i]));
        }
    }

"listener.onServiceAdd(...) " is called for each URL and is implemented by 
DiscoveryNetworkConnector#onServiceAdd(...).  The main thread calls 
BrokerService#start()  which calls DiscoveryNetworkConnector#start() which 
calls SimpleDiscoveryAgent#start(), which sequentially calls 
DiscoveryNetworkConnector#onServiceAdd(...).  Since the URLs being "discovered" 
are inactive, DiscoveryNetworkConnector#onServiceAdd(...)  blocks ~1m30s (this 
will depend on network configuration) for each URL.  This blocks the main 
thread that is trying to start the broker.  If there are several inactive URLs, 
then the blocking time becomes excessive.

Solution
=======
If you follow through the DiscoveryNetworkConnector#onServiceAdd(...) method, 
it eventially calls SimpleDiscoveryAgent#serviceFailed(...) for each inactive 
URL.  In turn SimpleDiscoveryAgent#serviceFailed(...) launches an asynchronous 
task that pauses for the configured reconnect delay, and then retries the call 
to DiscoveryNetworkConnector#onServiceAdd(...).  So it must be safe to call 
DiscoveryNetworkConnector#onServiceAdd(...) concurrently.  Therefore, 
SimpleDiscoveryAgent#start()'s loop should be changed to launch asynchronous 
tasks to make the DiscoveryNetworkConnector#onServiceAdd(...) calls 
concurrently rather than synchronously.

This solution has the benefit of returning control immediately to the caller of 
SimpleDiscoveryAgent#start(...), thus starting the broker faster, and allows 
the network discovery to find "active" URLs much faster (i.e., with the 
sequential loop, if the "active" URL is the last one, its discovery is 
significantly delayed).



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to