[ 
https://issues.apache.org/jira/browse/QPID-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248476#comment-17248476
 ] 

Alex Rudyy commented on QPID-8489:
----------------------------------

Hi [~daniel.kirilyuk],
Thanks for trying out the patch. To be fair, I really doubted that implemented 
changes would fix the issue in a right way. I actually was surprised to read 
that similar changes helped to resolve similar problems reported in other 
project.

The netstat indicates that there is one byte to read on the socket:
tcp6 1 0 devhost.deu:10606 devhost.deu:38386 CLOSE_WAIT

It seems that this remaining bytes causes a 'read' selector key being set. 
However,  the Broker cannot read, as connection 'SSLEngine' handhsake status is 
set to 'NEED_WRAP'. The broker does not have anything to write at this point. 
An empty buffer is passed into SSLEngine#wrap, but that does not have any 
effect. As result, the loop repeats.

The exhibited behaviour looks like a JDK defect to me. Today, I looked through 
JDK defects and discovered a couple defects matching our behaviour.

[https://bugs.openjdk.java.net/browse/JDK-8240071]. This one is closed as 
duplicate of JDK-8214418. The latter is unfortunately a private defect. I bet 
this JIRA might reveal a lot of interesting details.
[https://bugs.openjdk.java.net/browse/JDK-8220703]. I am not sure that your 
Qpid Broker is hit by this JDK  defect, but it could be. If client closes its 
end, when broker engine is in state NO_HANDSHAKING, the engine can go into 
NEED_WRAP state. Potentially adding debug logs reporting the result of wrap 
operation might prove or disprove that.

It seems that work around for the issue is disabling TLSv1.3.






> Connection thread looping
> -------------------------
>
>                 Key: QPID-8489
>                 URL: https://issues.apache.org/jira/browse/QPID-8489
>             Project: Qpid
>          Issue Type: Bug
>          Components: Broker-J
>    Affects Versions: qpid-java-broker-8.0.2
>            Reporter: Daniil Kirilyuk
>            Priority: Minor
>         Attachments: 
> 0001-QPID-8489-Break-an-infinite-connection-processing-lo.patch, QPID-8489 - 
> java.ssl.debug.log, QPID-8489.log, simple_recv.cpp, thread-dump.st
>
>
> Error happens quite rarely and is not easy to reproduce. Although the problem 
> was partly fixed by fixing QPID-8477, it still can be reproduced. The main 
> symptom is significant increase of CPU usage even when no messages are sent 
> to broker anymore. CPU usage can rise from 30% to 90% and higher, making 
> broker unusable. After such CPU rise the only way to fix broker will be 
> restarting it.
> Analysis has shown, that error occurs with CPP proton client in cases when
> 1) SSL connection is used
> 2) connection errors on client side are ignored
> 3) connection is dropped due to the client process termination / network 
> disconnection
> *Steps to reproduce*
>  # Java broker should be installed
>  # Broker should be configured to allow one connection
>  # Prepare certificates
>  # Install Qpid::Proton 0.28.0
> wget 
> [http://archive.apache.org/dist/qpid/proton/0.28.0/qpid-proton-0.28.0.tar.gz]
> gunzip qpid-proton-0.28.0.tar.gz
> mkdir -p qpid-proton-0.28.0/build && pushd qpid-proton-0.28.0/build && cmake 
> .. && make all && popd
>  # Replace and edit example *qpid-proton-0.28.0/cpp/examples/simple_recv.cpp* 
> with the one attached
>  # Build again
> cd qpid-proton-0.28.0/build
> make
>  # Break the broker
> ./cpp/examples/simple_recv & ./cpp/examples/simple_recv
> Connection error
> {color:#ffab00}^C <= Hit Ctrl+C to kill process{color}
>  # {color:#172b4d}If CPU usage didn't increased, find the PID of the first 
> simple_recv process using ps-ef | grep simple_recv and kill it using kill -9 
> PID.{color}
> *Analysis*
> CPU usage rises when connection is dropped on the client side or when network 
> is broken between client and broker. The main point is that client isn't well 
> behaved and connection shouldn't be closed correctly.
> On broker side connection becomes "orphaned": it is still maintained by 
> broker, but no real reading / writing is performed. Following method calls 
> are performed in an endless loop for each "orphaned" connection:
> SelectorThread.performSelect() 
> SelectorThread.ConnectionProcessor.processConnection()
> NetworkConnectionScheduler.processConnection()
> NonBlockingConnection.doWork()
> As there nothing physically read or written, both methods 
> NonBlockingConnection.doRead() and NonBlockingConnection.doWrite() execute 
> very fast (several milliseconds) without any blocking processes and after 
> that connection is immediately rescheduled for processing in 
> NetworkConnectionScheduler. After that loop repeats.
> As the connection lifecycle is normal, there is logged nothing unusual or 
> suspicious (nothing is seen in log at all).
> In thread dump (see attachment) there is seen, that utilized are mostly 
> thread with names virtualhost-default-iopool-XX. Typical stacktrace looks 
> like following:
> {noformat}
> "virtualhost-default-iopool-39" #92 prio=5 os_prio=0 tid=0x00007f47ec335800 
> nid=0x37196 waiting on condition 
> [0x00007f476a4e3000]"virtualhost-default-iopool-39" #92 prio=5 os_prio=0 
> tid=0x00007f47ec335800 nid=0x37196 waiting on condition [0x00007f476a4e3000]  
>  java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
> Method) - parking to wait for  <0x00000000f39105d0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:532) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at 
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
>  at 
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$18/422330142.run(Unknown
>  Source) at java.lang.Thread.run(Thread.java:748)}
> {noformat}
> The main symptom of an error is rising CPU usage, which can reach up to 90% 
> in case, when several connections are "orphaned". Additional factor leading 
> to the problem is disabled keep-alive option for a connection or long 
> keep-alive interval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to