[
https://issues.apache.org/jira/browse/QPID-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244873#comment-17244873
]
Alex Rudyy edited comment on QPID-8489 at 12/6/20, 11:34 PM:
-------------------------------------------------------------
Hi [~daniel.kirilyuk],
I tried to reproduce the reported problem on my machine. I followed the
instructions as specified in the JIRA description, but I was not able to
reproduce the problem.
I even tried to use long keep-alive interval, as hinted in the description, but
it did not help me:
{noformat}
> sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl
> net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
> sudo sysctl -w net.ipv4.tcp_keepalive_time=999999
> net.ipv4.tcp_keepalive_intvl=999999 net.ipv4.tcp_keepalive_probes=999999
net.ipv4.tcp_keepalive_time = 999999
net.ipv4.tcp_keepalive_intvl = 999999
net.ipv4.tcp_keepalive_probes = 999999
> sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl
> net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time = 999999
net.ipv4.tcp_keepalive_intvl = 999999
net.ipv4.tcp_keepalive_probes = 999999
{noformat}
A tight loop in the selector thread indicates about a defect in the connection
handling by the selector thread.
It is not exactly clear for me, what is happening, but, it seems, that impacted
connection somehow sets the flag AMQPConnection_0_10Impl#_stateChanged to
'true' in the #doWork() which results in immediate execution or rescheduling of
the connection.
Would you be able to add an extra logging into a SelectorThread and
NonBlockingConnection and repeat your reproduction in order to figure out what
exactly is causing connection to be re-scheduled?
The changes in pull request 73 cannot be a valid fix for the issue. The
implemented changes send the heartbeat in order to verify that connection is
alive. If heart beating is negotiated during connection opening, the broker
should be sending heartbeats anyway. If heartbeating is not negotiated, than
sending a heartbeat and expecting a response would be a wrong thing to do.
We need to fix the root cause behind re-running the connection in the selector
thread. It should not really happen, as there is no read/write event coming
from the socket and there should not be any work for connection created on the
broker side. The connection should be sitting in the unscheduled connections on
the SelectorThread until OS notifies the broker that socket is closed.
was (Author: alex.rufous):
Hi [~daniel.kirilyuk],
I tried to reproduce the reported problem on my machine. I followed the
instructions as specified the JIRA description but I was not able to reproduce
the problem.
I even tried to use long keep-alive interval, as hinted in the description but
it did not help me:
{noformat}
> sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl
> net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
> sudo sysctl -w net.ipv4.tcp_keepalive_time=999999
> net.ipv4.tcp_keepalive_intvl=999999 net.ipv4.tcp_keepalive_probes=999999
net.ipv4.tcp_keepalive_time = 999999
net.ipv4.tcp_keepalive_intvl = 999999
net.ipv4.tcp_keepalive_probes = 999999
> sysctl net.ipv4.tcp_keepalive_time net.ipv4.tcp_keepalive_intvl
> net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time = 999999
net.ipv4.tcp_keepalive_intvl = 999999
net.ipv4.tcp_keepalive_probes = 999999
{noformat}
A tight loop in the selector thread indicates about a defect in the connection
handling by the selector thread.
It is not exactly clear for me, what is happening, but, it seems, that impacted
connection somehow sets the flag AMQPConnection_0_10Impl#_stateChanged to
'true' in the #doWork() which results in immediate execution or rescheduling of
the connection.
Would you be able to add an extra logging into a SelectorThread and
NonBlockingConnection and repeat your reproduction in order to figure out what
exactly is causing connection re-scheduling?
The changes in pull request 73 cannot be a valid fix for the issue. The
implemented changes send the heartbeat in order to verify that connection is
alive. If heart beating is negotiated during connection opening, the broker
should be sending heartbits anyway. If heartbeating is not negotiated, than
sending a heartbeat and expecting a response would be a wrong thing to do.
We need to fix the root cause behind re-running the connection in the selector
thread. It should not really happen, as there is no read/write event coming
from the socket and there should not be any work for connection created on the
broker side. The connection should be sitting in the unscheduled connections on
the SelectorThread until OS notifies the broker that socket is closed.
> Connection thread looping
> -------------------------
>
> Key: QPID-8489
> URL: https://issues.apache.org/jira/browse/QPID-8489
> Project: Qpid
> Issue Type: Bug
> Components: Broker-J
> Affects Versions: qpid-java-broker-8.0.2
> Reporter: Daniil Kirilyuk
> Priority: Minor
> Attachments: simple_recv.cpp, thread-dump.st
>
>
> Error happens quite rarely and is not easy to reproduce. Although the problem
> was partly fixed by fixing QPID-8477, it still can be reproduced. The main
> symptom is significant increase of CPU usage even when no messages are sent
> to broker anymore. CPU usage can rise from 30% to 90% and higher, making
> broker unusable. After such CPU rise the only way to fix broker will be
> restarting it.
> Analysis has shown, that error occurs with CPP proton client in cases when
> 1) SSL connection is used
> 2) connection errors on client side are ignored
> 3) connection is dropped due to the client process termination / network
> disconnection
> *Steps to reproduce*
> # Java broker should be installed
> # Broker should be configured to allow one connection
> # Prepare certificates
> # Install Qpid::Proton 0.28.0
> wget
> [http://archive.apache.org/dist/qpid/proton/0.28.0/qpid-proton-0.28.0.tar.gz]
> gunzip qpid-proton-0.28.0.tar.gz
> mkdir -p qpid-proton-0.28.0/build && pushd qpid-proton-0.28.0/build && cmake
> .. && make all && popd
> # Replace and edit example *qpid-proton-0.28.0/cpp/examples/simple_recv.cpp*
> with the one attached
> # Build again
> cd qpid-proton-0.28.0/build
> make
> # Break the broker
> ./cpp/examples/simple_recv & ./cpp/examples/simple_recv
> Connection error
> {color:#ffab00}^C <= Hit Ctrl+C to kill process{color}
> # {color:#172b4d}If CPU usage didn't increased, find the PID of the first
> simple_recv process using ps-ef | grep simple_recv and kill it using kill -9
> PID.{color}
> *Analysis*
> CPU usage rises when connection is dropped on the client side or when network
> is broken between client and broker. The main point is that client isn't well
> behaved and connection shouldn't be closed correctly.
> On broker side connection becomes "orphaned": it is still maintained by
> broker, but no real reading / writing is performed. Following method calls
> are performed in an endless loop for each "orphaned" connection:
> SelectorThread.performSelect()
> SelectorThread.ConnectionProcessor.processConnection()
> NetworkConnectionScheduler.processConnection()
> NonBlockingConnection.doWork()
> As there nothing physically read or written, both methods
> NonBlockingConnection.doRead() and NonBlockingConnection.doWrite() execute
> very fast (several milliseconds) without any blocking processes and after
> that connection is immediately rescheduled for processing in
> NetworkConnectionScheduler. After that loop repeats.
> As the connection lifecycle is normal, there is logged nothing unusual or
> suspicious (nothing is seen in log at all).
> In thread dump (see attachment) there is seen, that utilized are mostly
> thread with names virtualhost-default-iopool-XX. Typical stacktrace looks
> like following:
> {noformat}
> "virtualhost-default-iopool-39" #92 prio=5 os_prio=0 tid=0x00007f47ec335800
> nid=0x37196 waiting on condition
> [0x00007f476a4e3000]"virtualhost-default-iopool-39" #92 prio=5 os_prio=0
> tid=0x00007f47ec335800 nid=0x37196 waiting on condition [0x00007f476a4e3000]
> java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native
> Method) - parking to wait for <0x00000000f39105d0> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
> org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:532)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
> at
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$18/422330142.run(Unknown
> Source) at java.lang.Thread.run(Thread.java:748)}
> {noformat}
> The main symptom of an error is rising CPU usage, which can reach up to 90%
> in case, when several connections are "orphaned". Additional factor leading
> to the problem is disabled keep-alive option for a connection or long
> keep-alive interval.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]