[jira] [Created] (QPID-7655) It takes about 15 minutes for a push route (bridge) to restore communications with a remote broker upon network recovery

Jun Chen (JIRA) Wed, 01 Feb 2017 19:48:06 -0800

Jun Chen created QPID-7655:
------------------------------

             Summary: It takes about 15 minutes for a push route (bridge) to 
restore communications with a remote broker upon network recovery
                 Key: QPID-7655
                 URL: https://issues.apache.org/jira/browse/QPID-7655
             Project: Qpid
          Issue Type: Bug
          Components: C++ Broker
    Affects Versions: qpid-cpp-1.36.0, qpid-cpp-1.35.0
         Environment: Linux C++ broker with a bridge (push route) to a remote 
C++ broker
and the network connection to the remote broker can be turned off and on
            Reporter: Jun Chen



We have the following use case and issue: 
1. a Linux C++ broker is used on a mobile device. 
2. there is a push route (bridge) between a mobile Linux C++ broker and a 
remote broker established using the following:  
qpid-route queue add -s -d -v --ack 1 --credit 100 remoteBroker mobileBroker 
remoteExchange mobileQueue
3. the network connectivity for the mobile device is intermittent which means 
that the mobile device could be out of communication and then be in 
communication for some time. We would like that the qpid broker on the  mobile 
device can recover from communication losses as soon as possible when it 
transitions from out of communication to in communication. However in the 
current normal running environment, when this happens, it takes about 15 
minutes for the mobile broker to recover communications with the remote broker.

To re-create the issue, one must:
1. run the mobile broker with qpid logging set to info+. 
2. using qpid-send to send some messages to the mobile broker periodically say 
every second (to simulate messages put to the mobileQueue to be routed to the 
remote broker).

Then here is what it looked like

The network connection was turned off 2-3 minutes before 2017-01-24 21:20:03

then:

2017-01-24 21:20:03 [Protocol] error Federation link connection 
qpid.tcp:remoteBroker:5672 missed 2 heartbeats - closing connection
..........

Network connection was actually restored around 2017-01-24 21:21:00. However 
the inter-broker link could only be restored 15+ minutes later:
..........
2017-01-24 21:36:25 [System] error Error reading socket: Invalid argument(22)
2017-01-24 21:36:25 [Broker] info Inter-broker link disconnected from 
remoteBroker:5672 Closed by peer    (Note: this disconnect should happen at 
2017-01-24 21:20:03 when the connection was closed after 2 heartbeats were 
missed)
2017-01-24 21:36:25 [System] info Connecting: remoteBroker:5672
2017-01-24 21:36:25 [Broker] info Inter-broker link connecting to 
remoteBroker:5672
2017-01-24 21:36:25 [Broker] info Inter-broker link established to 
remoteBroker:5672
 
So the heartbeats triggered the connection close at 21:20:03 as expected. 
However, it almost seems as if the connection didn't close properly. There was 
nothing in the log after the "closing connection" until the socket error 
occured 15+ minutes later. No attempts to reconnect, nothing until the socket 
error.  Network connection was back for 15 minutes by the time the connection 
closed at 21:36:25. It then connected promptly and the messages started being 
routed out to the remote broker within 5-10 seconds of the link succeeding,

The following is my analysis of the root cause from C++  source code. There is 
also a workaround. But the bug needs to be fixed:
--------

Here is a bit more detailed issue why qpid broker takes 15+ minutes to close a 
connection which caused message delay if a device is out of the comms and then 
in comms again:

-------------
The issue was that in /qpid/sys/posix/AsynchIO.cpp, the following function can 
only be called if there is a writeable socket event 

void AsynchIO::writeable(DispatchHandle& h)

In the middle of the function there was a place that DispatchHandler will be 
closed by 

close(h);


In the same file, there was another function 

void AsynchIO::queueWriteClose()

In this function it will set a flag of queuedClose to true when there was a 
communication loss. The flag will trigger close(h) in the aforementioned 
function of writeable(DispatchHandle& h) when the  writeable() function is 
called next time. However if this function is not triggered, close(h) will not 
be called until 15 mins later the socket error happens.

The fix shall be provided so that when queuedClose is set, writeable() must be 
called immediately to close(h).

Without source code change or fix, the workaround is to increase the socket 
send buffer size by the following linux commands so that writeable() can still 
be triggered after 2-3 minutes queuedClose is set:
sysctl -w net.core.wmem_max=4194304
sysctl -w net.ipv4.tcp_wmem='4096 4194304 4194304'
sysctl -w net.ipv4.route.flush=1 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (QPID-7655) It takes about 15 minutes for a push route (bridge) to restore communications with a remote broker upon network recovery

Reply via email to