Daniil Kirilyuk created QPID-8489:
-------------------------------------

             Summary: Connection thread looping
                 Key: QPID-8489
                 URL: https://issues.apache.org/jira/browse/QPID-8489
             Project: Qpid
          Issue Type: Bug
          Components: Broker-J
    Affects Versions: qpid-java-broker-8.0.2
            Reporter: Daniil Kirilyuk


Error happens quite rarely and is not easy to reproduce. Although the problem 
was partly fixed by fixing QPID-8477, it still can be reproduced. The main 
symptom is significant increase of CPU usage even when no messages are sent to 
broker anymore. CPU usage can rise from 30% to 90% and higher, making broker 
unusable. After such CPU rise the only way to fix broker will be restarting it.

Analysis has shown, that error occurs with CPP proton client in cases when

1) SSL connection is used

2) connection errors on client side are ignored

3) connection is dropped due to the client process termination / network 
disconnection

*Steps to reproduce*
 # Java broker should be installed
 # Broker should be configured to allow one connection
 # Prepare certificates
 # Install Qpid::Proton 0.28.0

wget 
[http://archive.apache.org/dist/qpid/proton/0.28.0/qpid-proton-0.28.0.tar.gz]

gunzip qpid-proton-0.28.0.tar.gz

mkdir -p qpid-proton-0.28.0/build && pushd qpid-proton-0.28.0/build && cmake .. 
&& make all && popd

 # Replace and edit example *qpid-proton-0.28.0/cpp/examples/simple_recv.cpp* 
with the one attached

 # Build again

cd qpid-proton-0.28.0/build

make

 # Break the broker

./cpp/examples/simple_recv & ./cpp/examples/simple_recv

Connection error

{color:#ffab00}^C <= Hit Ctrl+C to kill process{color}

 # {color:#172b4d}If CPU usage didn't increased, find the PID of the first 
simple_recv process using ps-ef | grep simple_recv and kill it using kill -9 
PID.{color}

*Analysis*

CPU usage rises when connection is dropped on the client side or when network 
is broken between client and broker. The main point is that client isn't well 
behaved and connection shouldn't be closed correctly.

On broker side connection becomes "orphaned": it is still maintained by broker, 
but no real reading / writing is performed. Following method calls are 
performed in an endless loop for each "orphaned" connection:

SelectorThread.performSelect() 

SelectorThread.ConnectionProcessor.processConnection()

NetworkConnectionScheduler.processConnection()

NonBlockingConnection.doWork()

As there nothing physically read or written, both methods 
NonBlockingConnection.doRead() and NonBlockingConnection.doWrite() execute very 
fast (several milliseconds) without any blocking processes and after that 
connection is immediately rescheduled for processing in 
NetworkConnectionScheduler. After that loop repeats.

As the connection lifecycle is normal, there is logged nothing unusual or 
suspicious (nothing is seen in log at all).

In thread dump (see attachment) there is seen, that utilized are mostly thread 
with names virtualhost-default-iopool-XX. Typical stacktrace looks like 
following:
{noformat}
"virtualhost-default-iopool-39" #92 prio=5 os_prio=0 tid=0x00007f47ec335800 
nid=0x37196 waiting on condition 
[0x00007f476a4e3000]"virtualhost-default-iopool-39" #92 prio=5 os_prio=0 
tid=0x00007f47ec335800 nid=0x37196 waiting on condition [0x00007f476a4e3000]   
java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
Method) - parking to wait for  <0x00000000f39105d0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
at org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:532) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at 
org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
 at 
org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$18/422330142.run(Unknown
 Source) at java.lang.Thread.run(Thread.java:748)}
{noformat}
The main symptom of an error is rising CPU usage, which can reach up to 90% in 
case, when several connections are "orphaned". Additional factor leading to the 
problem is disabled keep-alive option for a connection or long keep-alive 
interval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to