[ 
https://issues.apache.org/jira/browse/QPID-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246857#comment-17246857
 ] 

Alex Rudyy commented on QPID-8489:
----------------------------------

Hi [~daniel.kirilyuk],
Thanks for sharing netstat data.

Here is my current theory about the reported issue.

It seems that client closed the TCP connection following the connection 
termination four-way handshake. The broker has received a connection 
termination request and acknowledged it but have not performed an active close 
(this corresponds to CLOSE_WAIT state reported by the netstat). I suspect that 
client has managed to send TLS {{close_notify}}, but broker has not been able 
to do so. The TLS handshake status in Broker {{SSLEngine}} is set to 
{{SSLEngineResult.HandshakeStatus.NEED_WRAP}} (indicated by log 'doRead() - 
_closed: false, _delegate.readyForRead(): false'). It seems that engine needs 
to send some remaining data (maybe TLS {{close_notify}}), but the write 
attempts do not succeed (as per log '_bufferedSize: 0, _fullyWritten: true').  
The {{SSLEngine}} handshake status continues to remain 
{{SSLEngineResult.HandshakeStatus.NEED_WRAP}}. It is not exactly clear for me 
yet why {{SSLEngine}} cannot finish its write. The implementation of write 
functionality looks good to me. I could not eye-ball any problem in 
{{NonBlockingConnectionTLSDelegate}} with write or read implementations.

In order to exclude any JDK related defect could you please share the JDK 
version in use and TLS protocol version?

Possibly, it would be useful to capture broker JVM debug logging 
(-Djavax.net.debug=ssl) for the connection close (covering the time interval 
when client close the connection and client process is killed ). I guess we 
should see there a receipt of {{close_notify}} from the client and no broker 
response with {{close_notify}}. I hope that JVM debug logs for connection close 
would give us some hints on what had happened on the broker side. If you could 
share JVM logs, that would be very helpful.

If you can also capture {{netstat -t}} to get a number of bytes waiting to be 
read, that would be helpful as well.

Kind Regards,
Alex









> Connection thread looping
> -------------------------
>
>                 Key: QPID-8489
>                 URL: https://issues.apache.org/jira/browse/QPID-8489
>             Project: Qpid
>          Issue Type: Bug
>          Components: Broker-J
>    Affects Versions: qpid-java-broker-8.0.2
>            Reporter: Daniil Kirilyuk
>            Priority: Minor
>         Attachments: QPID-8489.log, simple_recv.cpp, thread-dump.st
>
>
> Error happens quite rarely and is not easy to reproduce. Although the problem 
> was partly fixed by fixing QPID-8477, it still can be reproduced. The main 
> symptom is significant increase of CPU usage even when no messages are sent 
> to broker anymore. CPU usage can rise from 30% to 90% and higher, making 
> broker unusable. After such CPU rise the only way to fix broker will be 
> restarting it.
> Analysis has shown, that error occurs with CPP proton client in cases when
> 1) SSL connection is used
> 2) connection errors on client side are ignored
> 3) connection is dropped due to the client process termination / network 
> disconnection
> *Steps to reproduce*
>  # Java broker should be installed
>  # Broker should be configured to allow one connection
>  # Prepare certificates
>  # Install Qpid::Proton 0.28.0
> wget 
> [http://archive.apache.org/dist/qpid/proton/0.28.0/qpid-proton-0.28.0.tar.gz]
> gunzip qpid-proton-0.28.0.tar.gz
> mkdir -p qpid-proton-0.28.0/build && pushd qpid-proton-0.28.0/build && cmake 
> .. && make all && popd
>  # Replace and edit example *qpid-proton-0.28.0/cpp/examples/simple_recv.cpp* 
> with the one attached
>  # Build again
> cd qpid-proton-0.28.0/build
> make
>  # Break the broker
> ./cpp/examples/simple_recv & ./cpp/examples/simple_recv
> Connection error
> {color:#ffab00}^C <= Hit Ctrl+C to kill process{color}
>  # {color:#172b4d}If CPU usage didn't increased, find the PID of the first 
> simple_recv process using ps-ef | grep simple_recv and kill it using kill -9 
> PID.{color}
> *Analysis*
> CPU usage rises when connection is dropped on the client side or when network 
> is broken between client and broker. The main point is that client isn't well 
> behaved and connection shouldn't be closed correctly.
> On broker side connection becomes "orphaned": it is still maintained by 
> broker, but no real reading / writing is performed. Following method calls 
> are performed in an endless loop for each "orphaned" connection:
> SelectorThread.performSelect() 
> SelectorThread.ConnectionProcessor.processConnection()
> NetworkConnectionScheduler.processConnection()
> NonBlockingConnection.doWork()
> As there nothing physically read or written, both methods 
> NonBlockingConnection.doRead() and NonBlockingConnection.doWrite() execute 
> very fast (several milliseconds) without any blocking processes and after 
> that connection is immediately rescheduled for processing in 
> NetworkConnectionScheduler. After that loop repeats.
> As the connection lifecycle is normal, there is logged nothing unusual or 
> suspicious (nothing is seen in log at all).
> In thread dump (see attachment) there is seen, that utilized are mostly 
> thread with names virtualhost-default-iopool-XX. Typical stacktrace looks 
> like following:
> {noformat}
> "virtualhost-default-iopool-39" #92 prio=5 os_prio=0 tid=0x00007f47ec335800 
> nid=0x37196 waiting on condition 
> [0x00007f476a4e3000]"virtualhost-default-iopool-39" #92 prio=5 os_prio=0 
> tid=0x00007f47ec335800 nid=0x37196 waiting on condition [0x00007f476a4e3000]  
>  java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native 
> Method) - parking to wait for  <0x00000000f39105d0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:532) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at 
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
>  at 
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$18/422330142.run(Unknown
>  Source) at java.lang.Thread.run(Thread.java:748)}
> {noformat}
> The main symptom of an error is rising CPU usage, which can reach up to 90% 
> in case, when several connections are "orphaned". Additional factor leading 
> to the problem is disabled keep-alive option for a connection or long 
> keep-alive interval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to