[jira] [Comment Edited] (QPID-8514) High CPU usage and stucked connections

Alex Rudyy (Jira) Sun, 11 Apr 2021 15:30:06 -0700


    [ 
https://issues.apache.org/jira/browse/QPID-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318944#comment-17318944
 ]


Alex Rudyy edited comment on QPID-8514 at 4/11/21, 10:29 PM:
-------------------------------------------------------------

Hi [~daniel.kirilyuk]

I spent some time looking into the reported issue. I believe it is caused by a 
defect in RingOverflowPolicyHandler.
The RingOverflowPolicy implementation is trying to delete the oldest message on 
the queue which is evaluated by {{queue.getLeastSignificantOldestEntry()}}. If 
that message is acquired by the consumer (and its state type is not instance of 
{{StealableConsumerAcquiredState}}), the existing delete functionality cannot 
steal the message (change it's state). It registers a spacial listener to 
acquire and delete it later (after consumer release the message). If either of 
queue depths, does not drop  below the maximum allowed limits (which is the 
case when max message depth is 1), the  {{RingOverflowPolicyHandler}} can 
continue iterating  through the same code attempting to delete the same 
consumer acquired message. This results in registration of duplicate 
StateChangeListenerEntry on every loop. That would cause eventually a high CPU 
consumption,  as a linked list (keeping state change listeners) would grow in 
size on every iteration.

If the consumer will not settle the message (due to some hang state), that 
might cause an infinite loop in RingOverflowPolicyHandler on enqueue of a new 
message. Eventually, the Broker will run out of memory due to a grow of 
StateChangeListener list.

I think we can try to resolve this issue with the following:
* change an implementation {{queue.getLeastSignificantOldestEntry()}} to get 
non-consumer acquired oldest message
* if there is no available entries, the {{RingOverflowPolicyHandler}} should 
stop the deletion attempts and break the loop

Kind Regards,
Alex



was (Author: alex.rufous):
Hi [~daniel.kirilyuk]

I spent some time looking into the reported issue. I believe it is caused by a 
defect in RingOverflowPolicyHandler.
The RingOverflowPolicy implementation is trying to delete the oldest message on 
the queue which is evaluated by {{queue.getLeastSignificantOldestEntry()}}. If 
that message is acquired by the consumer (and its state type is instance of 
{{StealableConsumerAcquiredState}}), the existing delete functionality cannot 
steal the message (change it's state). It registers a spacial listener to 
acquire and delete it later (after consumer release the message). If either of 
queue depths, does not drop  below the maximum allowed limits (which is the 
case when max message depth is 1), the  {{RingOverflowPolicyHandler}} can 
continue iterating  through the same code attempting to delete the same 
consumer acquired message. This results in registration of duplicate 
StateChangeListenerEntry on every loop. That would cause eventually a high CPU 
consumption,  as a linked list (keeping state change listeners) would grow in 
size on every iteration.

If the consumer will not settle the message (due to some hang state), that 
might cause an infinite loop in RingOverflowPolicyHandler on enqueue of a new 
message. Eventually, the Broker will run out of memory due to a grow of 
StateChangeListener list.

I think we can try to resolve this issue with the following:
* change an implementation {{queue.getLeastSignificantOldestEntry()}} to get 
non-consumer acquired oldest message
* if there is no available entries, the {{RingOverflowPolicyHandler}} should 
stop the deletion attempts and break the loop

Kind Regards,
Alex


> High CPU usage and stucked connections
> --------------------------------------
>
>                 Key: QPID-8514
>                 URL: https://issues.apache.org/jira/browse/QPID-8514
>             Project: Qpid
>          Issue Type: Bug
>          Components: Broker-J
>    Affects Versions: qpid-java-broker-8.0.4
>            Reporter: Daniil Kirilyuk
>            Priority: Minor
>         Attachments: RingPolicyConcurrencyTest.java, threaddump.zip
>
>
> There were around 600 connections in state CLOSE__WAIT: 
> {noformat}
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep TIME_WAIT | wc -l
> 38
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep CLOSE_WAIT | wc -l
> 625
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep -v CLOSE_WAIT | grep 
> -V TIME_WAIT | wc -l
> 7
> {noformat}
> CPU usage raises up to 90% and higher.
> Connections couldn't be closed by webgui. Broker log contains message below, 
> but connections were not closed.
> {noformat}
> 2021-03-26 15:12:28,474 INFO  [Broker-Config] (q.m.c.model_delete) - 
> [mng:3hpU+miq(admin@/127.0.0.1:62743)] 
> [con:751(TEST@/127.0.0.1:40422/default)] CON-1007 : Connection close 
> initiated by operator
> {noformat}
>  Thread dumps shows around 49 threads with same stacktrace related to ring 
> queue
> {noformat}
> java.lang.Thread.State: RUNNABLE
>       at 
> org.apache.qpid.server.queue.QueueEntryImpl.addStateChangeListener(QueueEntryImpl.java:674)
>       at 
> org.apache.qpid.server.queue.QueueEntryImpl.acquireOrSteal(QueueEntryImpl.java:316)
>       at 
> org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1913)
>       at 
> org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1863)
>       at 
> org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.checkOverflow(RingOverflowPolicyHandler.java:102)
>       at 
> org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.access$000(RingOverflowPolicyHandler.java:44)
>       at 
> org.apache.qpid.server.queue.RingOverflowPolicyHandler.checkOverflow(RingOverflowPolicyHandler.java:41)
>       at 
> org.apache.qpid.server.queue.AbstractQueue.doEnqueue(AbstractQueue.java:1382)
>       at 
> org.apache.qpid.server.queue.AbstractQueue.enqueue(AbstractQueue.java:1290)
>       at 
> org.apache.qpid.server.message.RoutingResult$1.postCommit(RoutingResult.java:135)
>       at 
> org.apache.qpid.server.txn.AsyncAutoCommitTransaction$3.postCommit(AsyncAutoCommitTransaction.java:309)
>       at 
> org.apache.qpid.server.txn.AsyncCommand.complete(AsyncCommand.java:84)
>       at 
> org.apache.qpid.server.protocol.v1_0.StandardReceivingLinkEndpoint.receiveComplete(StandardReceivingLinkEndpoint.java:588)
>       at 
> org.apache.qpid.server.protocol.v1_0.Session_1_0.lambda$receivedComplete$5(Session_1_0.java:1343)
>       at 
> org.apache.qpid.server.protocol.v1_0.Session_1_0$$Lambda$142/959267941.accept(Unknown
>  Source)
>       at java.lang.Iterable.forEach(Iterable.java:75)
>       at 
> org.apache.qpid.server.protocol.v1_0.Session_1_0.receivedComplete(Session_1_0.java:1343)
>       at 
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.lambda$receivedComplete$11(AMQPConnection_1_0Impl.java:1349)
>       at 
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl$$Lambda$141/1774823837.run(Unknown
>  Source)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at 
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.receivedComplete(AMQPConnection_1_0Impl.java:1347)
>       at 
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.onReceive(AMQPConnection_1_0Impl.java:1328)
>       at 
> org.apache.qpid.server.transport.AbstractAMQPConnection.lambda$received$2(AbstractAMQPConnection.java:576)
>       at 
> org.apache.qpid.server.transport.AbstractAMQPConnection$$Lambda$119/6380061.run(Unknown
>  Source)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at 
> org.apache.qpid.server.transport.AbstractAMQPConnection.received(AbstractAMQPConnection.java:571)
>       at 
> org.apache.qpid.server.transport.MultiVersionProtocolEngine.received(MultiVersionProtocolEngine.java:135)
>       at 
> org.apache.qpid.server.transport.NonBlockingConnection.processAmqpData(NonBlockingConnection.java:611)
>       at 
> org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.processData(NonBlockingConnectionPlainDelegate.java:58)
>       at 
> org.apache.qpid.server.transport.NonBlockingConnection.doRead(NonBlockingConnection.java:496)
>       at 
> org.apache.qpid.server.transport.NonBlockingConnection.doWork(NonBlockingConnection.java:270)
>       at 
> org.apache.qpid.server.transport.NetworkConnectionScheduler.processConnection(NetworkConnectionScheduler.java:134)
>       at 
> org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.processConnection(SelectorThread.java:575)
>       at 
> org.apache.qpid.server.transport.SelectorThread$SelectionTask.performSelect(SelectorThread.java:366)
>       at 
> org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(SelectorThread.java:97)
>       at 
> org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:533)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at 
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
>       at 
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$10/1487500813.run(Unknown
>  Source)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Also ring queue behaves strangely: queue "{{DeadLetterQueue"}} has set 
> maximum message depth to {{1}}, but in there were 55 message with 
> {{Acquired}} state.
>  First message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.381 (00:00 
> UTC)}}.
>  Last message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.683 (00:00 
> UTC)}}.
> These times correlates to timed out drain request at 16:52:00,410, because 
> default drain request is set 60 seconds:
> {noformat}
> ^[[0m^[[0m2021-03-25 16:52:00,408 INFO  [org.apache.qpid.jms.JmsSession] 
> (QpidJMS Connection Executor: ID:986cd915-d072-4711-bd71-50a74aab4be8:1) A 
> JMS MessageConsumer has been closed: JmsConsumerInfo: { 
> ID:986cd915-d072-4711-bd71-50a74aab4be8:1:1:1, destination = 
> broadcast.XX.PRISMA_LIVERiskLimitTrigger }
> ^[[0m^[[33m2021-03-25 16:52:00,410 WARN  
> [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel 
> (camel-1) thread #3 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger]) 
> Setup of JMS message listener invoker failed for destination 
> 'broadcast.XX.PRISMA_LIVERiskLimitTrigger' - trying to recover. Cause: Remote 
> did not respond to a drain request in time
> ^[[0m^[[33m2021-03-25 16:52:01,414 WARN  
> [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel 
> (camel-1) thread #6 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger]) 
> Setup of JMS message listener invoker failed for destination 
> 'broadcast.XX.PRISMA_LIVERiskLimitTrigger'
> {noformat}
> After the problem was triggered broker started to reject client connections:
> {noformat}
>  2021-03-25 17:55:43,716 INFO  [Broker-Config] (q.m.c.open) - 
> [con:901(/127.0.0.1:56012)] CON-1001 : Open : Destination : 
> amqp(127.0.0.1:22101) : Protocol Version : 1.0
> 2021-03-25 17:55:43,745 INFO  [IO-/127.0.0.1:56012] (q.m.c.open) - 
> [con:901(ARP@/127.0.0.1:56012/default)] CON-1001 : Open : Destination : 
> amqp(172.16.75.197:22101) : Protocol Version : 1.0 : Client ID : 
> ID:af50d087-0610-406a-9792-ffa5bf6b68b4:5 : Client Version : 0.51.0 : Client 
> Product : QpidJMS
> 2021-03-25 17:55:43,747 INFO  [IO-/127.0.0.1:56012] (q.m.a.denied) - 
> [con:901(ARP@/127.0.0.1:56012/default)] ACL-1002 : Denied : Access 
> Virtualhost NAME=default,VIRTUALHOST_NAME=default
> 2021-03-25 17:55:43,755 INFO  [Broker-Config] (q.m.c.dropped_connection) - 
> [con:901(ARP@/127.0.0.1:56012/default)] CON-1004 : Connection dropped
> {noformat}
> Issue seems to be quite rare: same test ran in a loop over several days and 
> this issue was not triggered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (QPID-8514) High CPU usage and stucked connections

Reply via email to