[
https://issues.apache.org/jira/browse/QPID-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniil Kirilyuk updated QPID-8489:
----------------------------------
Attachment: QPID-8489 - java.ssl.debug.log
> Connection thread looping
> -------------------------
>
> Key: QPID-8489
> URL: https://issues.apache.org/jira/browse/QPID-8489
> Project: Qpid
> Issue Type: Bug
> Components: Broker-J
> Affects Versions: qpid-java-broker-8.0.2
> Reporter: Daniil Kirilyuk
> Priority: Minor
> Attachments: QPID-8489 - java.ssl.debug.log, QPID-8489.log,
> simple_recv.cpp, thread-dump.st
>
>
> Error happens quite rarely and is not easy to reproduce. Although the problem
> was partly fixed by fixing QPID-8477, it still can be reproduced. The main
> symptom is significant increase of CPU usage even when no messages are sent
> to broker anymore. CPU usage can rise from 30% to 90% and higher, making
> broker unusable. After such CPU rise the only way to fix broker will be
> restarting it.
> Analysis has shown, that error occurs with CPP proton client in cases when
> 1) SSL connection is used
> 2) connection errors on client side are ignored
> 3) connection is dropped due to the client process termination / network
> disconnection
> *Steps to reproduce*
> # Java broker should be installed
> # Broker should be configured to allow one connection
> # Prepare certificates
> # Install Qpid::Proton 0.28.0
> wget
> [http://archive.apache.org/dist/qpid/proton/0.28.0/qpid-proton-0.28.0.tar.gz]
> gunzip qpid-proton-0.28.0.tar.gz
> mkdir -p qpid-proton-0.28.0/build && pushd qpid-proton-0.28.0/build && cmake
> .. && make all && popd
> # Replace and edit example *qpid-proton-0.28.0/cpp/examples/simple_recv.cpp*
> with the one attached
> # Build again
> cd qpid-proton-0.28.0/build
> make
> # Break the broker
> ./cpp/examples/simple_recv & ./cpp/examples/simple_recv
> Connection error
> {color:#ffab00}^C <= Hit Ctrl+C to kill process{color}
> # {color:#172b4d}If CPU usage didn't increased, find the PID of the first
> simple_recv process using ps-ef | grep simple_recv and kill it using kill -9
> PID.{color}
> *Analysis*
> CPU usage rises when connection is dropped on the client side or when network
> is broken between client and broker. The main point is that client isn't well
> behaved and connection shouldn't be closed correctly.
> On broker side connection becomes "orphaned": it is still maintained by
> broker, but no real reading / writing is performed. Following method calls
> are performed in an endless loop for each "orphaned" connection:
> SelectorThread.performSelect()
> SelectorThread.ConnectionProcessor.processConnection()
> NetworkConnectionScheduler.processConnection()
> NonBlockingConnection.doWork()
> As there nothing physically read or written, both methods
> NonBlockingConnection.doRead() and NonBlockingConnection.doWrite() execute
> very fast (several milliseconds) without any blocking processes and after
> that connection is immediately rescheduled for processing in
> NetworkConnectionScheduler. After that loop repeats.
> As the connection lifecycle is normal, there is logged nothing unusual or
> suspicious (nothing is seen in log at all).
> In thread dump (see attachment) there is seen, that utilized are mostly
> thread with names virtualhost-default-iopool-XX. Typical stacktrace looks
> like following:
> {noformat}
> "virtualhost-default-iopool-39" #92 prio=5 os_prio=0 tid=0x00007f47ec335800
> nid=0x37196 waiting on condition
> [0x00007f476a4e3000]"virtualhost-default-iopool-39" #92 prio=5 os_prio=0
> tid=0x00007f47ec335800 nid=0x37196 waiting on condition [0x00007f476a4e3000]
> java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native
> Method) - parking to wait for <0x00000000f39105d0> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
> org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:532)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
> at
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$18/422330142.run(Unknown
> Source) at java.lang.Thread.run(Thread.java:748)}
> {noformat}
> The main symptom of an error is rising CPU usage, which can reach up to 90%
> in case, when several connections are "orphaned". Additional factor leading
> to the problem is disabled keep-alive option for a connection or long
> keep-alive interval.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]