[ https://issues.apache.org/jira/browse/QPID-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318944#comment-17318944 ]
Alex Rudyy commented on QPID-8514: ---------------------------------- Hi [~daniel.kirilyuk] I spent some time looking into the reported issue. I believe it is caused by a defect in RingOverflowPolicyHandler. The RingOverflowPolicy implementation is trying to delete the oldest message on the queue which is evaluated by {{queue.getLeastSignificantOldestEntry()}}. If that message is acquired by the consumer (and its state type is instance of {{StealableConsumerAcquiredState}}), the existing delete functionality cannot steal the message (change it's state). It registers a spacial listener to acquire and delete it later (after consumer release the message). If either of queue depths, does not drop below the maximum allowed limits (which is the case when max message depth is 1), the {{RingOverflowPolicyHandler}} can continue iterating through the same code attempting to delete the same consumer acquired message. This results in registration of duplicate StateChangeListenerEntry on every loop. That would cause eventually a high CPU consumption, as a linked list (keeping state change listeners) would grow in size on every iteration. If the consumer will not settle the message (due to some hang state), that might cause an infinite loop in RingOverflowPolicyHandler on enqueue of a new message. Eventually, the Broker will run out of memory due to a grow of StateChangeListener list. I think we can try to resolve this issue with the following: * change an implementation {{queue.getLeastSignificantOldestEntry()}} to get non-consumer acquired oldest message * if there is no available entries, the {{RingOverflowPolicyHandler}} should stop the deletion attempts and break the loop Kind Regards, Alex > High CPU usage and stucked connections > -------------------------------------- > > Key: QPID-8514 > URL: https://issues.apache.org/jira/browse/QPID-8514 > Project: Qpid > Issue Type: Bug > Components: Broker-J > Affects Versions: qpid-java-broker-8.0.4 > Reporter: Daniil Kirilyuk > Priority: Minor > Attachments: threaddump.zip > > > There were around 600 connections in state CLOSE__WAIT: > {noformat} > [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep TIME_WAIT | wc -l > 38 > [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep CLOSE_WAIT | wc -l > 625 > [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep -v CLOSE_WAIT | grep > -V TIME_WAIT | wc -l > 7 > {noformat} > CPU usage raises up to 90% and higher. > Connections couldn't be closed by webgui. Broker log contains message below, > but connections were not closed. > {noformat} > 2021-03-26 15:12:28,474 INFO [Broker-Config] (q.m.c.model_delete) - > [mng:3hpU+miq(admin@/127.0.0.1:62743)] > [con:751(TEST@/127.0.0.1:40422/default)] CON-1007 : Connection close > initiated by operator > {noformat} > Thread dumps shows around 49 threads with same stacktrace related to ring > queue > {noformat} > java.lang.Thread.State: RUNNABLE > at > org.apache.qpid.server.queue.QueueEntryImpl.addStateChangeListener(QueueEntryImpl.java:674) > at > org.apache.qpid.server.queue.QueueEntryImpl.acquireOrSteal(QueueEntryImpl.java:316) > at > org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1913) > at > org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1863) > at > org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.checkOverflow(RingOverflowPolicyHandler.java:102) > at > org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.access$000(RingOverflowPolicyHandler.java:44) > at > org.apache.qpid.server.queue.RingOverflowPolicyHandler.checkOverflow(RingOverflowPolicyHandler.java:41) > at > org.apache.qpid.server.queue.AbstractQueue.doEnqueue(AbstractQueue.java:1382) > at > org.apache.qpid.server.queue.AbstractQueue.enqueue(AbstractQueue.java:1290) > at > org.apache.qpid.server.message.RoutingResult$1.postCommit(RoutingResult.java:135) > at > org.apache.qpid.server.txn.AsyncAutoCommitTransaction$3.postCommit(AsyncAutoCommitTransaction.java:309) > at > org.apache.qpid.server.txn.AsyncCommand.complete(AsyncCommand.java:84) > at > org.apache.qpid.server.protocol.v1_0.StandardReceivingLinkEndpoint.receiveComplete(StandardReceivingLinkEndpoint.java:588) > at > org.apache.qpid.server.protocol.v1_0.Session_1_0.lambda$receivedComplete$5(Session_1_0.java:1343) > at > org.apache.qpid.server.protocol.v1_0.Session_1_0$$Lambda$142/959267941.accept(Unknown > Source) > at java.lang.Iterable.forEach(Iterable.java:75) > at > org.apache.qpid.server.protocol.v1_0.Session_1_0.receivedComplete(Session_1_0.java:1343) > at > org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.lambda$receivedComplete$11(AMQPConnection_1_0Impl.java:1349) > at > org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl$$Lambda$141/1774823837.run(Unknown > Source) > at java.security.AccessController.doPrivileged(Native Method) > at > org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.receivedComplete(AMQPConnection_1_0Impl.java:1347) > at > org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.onReceive(AMQPConnection_1_0Impl.java:1328) > at > org.apache.qpid.server.transport.AbstractAMQPConnection.lambda$received$2(AbstractAMQPConnection.java:576) > at > org.apache.qpid.server.transport.AbstractAMQPConnection$$Lambda$119/6380061.run(Unknown > Source) > at java.security.AccessController.doPrivileged(Native Method) > at > org.apache.qpid.server.transport.AbstractAMQPConnection.received(AbstractAMQPConnection.java:571) > at > org.apache.qpid.server.transport.MultiVersionProtocolEngine.received(MultiVersionProtocolEngine.java:135) > at > org.apache.qpid.server.transport.NonBlockingConnection.processAmqpData(NonBlockingConnection.java:611) > at > org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.processData(NonBlockingConnectionPlainDelegate.java:58) > at > org.apache.qpid.server.transport.NonBlockingConnection.doRead(NonBlockingConnection.java:496) > at > org.apache.qpid.server.transport.NonBlockingConnection.doWork(NonBlockingConnection.java:270) > at > org.apache.qpid.server.transport.NetworkConnectionScheduler.processConnection(NetworkConnectionScheduler.java:134) > at > org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.processConnection(SelectorThread.java:575) > at > org.apache.qpid.server.transport.SelectorThread$SelectionTask.performSelect(SelectorThread.java:366) > at > org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(SelectorThread.java:97) > at > org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:533) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464) > at > org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$10/1487500813.run(Unknown > Source) > at java.lang.Thread.run(Thread.java:748) > {noformat} > Also ring queue behaves strangely: queue "{{DeadLetterQueue"}} has set > maximum message depth to {{1}}, but in there were 55 message with > {{Acquired}} state. > First message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.381 (00:00 > UTC)}}. > Last message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.683 (00:00 > UTC)}}. > These times correlates to timed out drain request at 16:52:00,410, because > default drain request is set 60 seconds: > {noformat} > ^[[0m^[[0m2021-03-25 16:52:00,408 INFO [org.apache.qpid.jms.JmsSession] > (QpidJMS Connection Executor: ID:986cd915-d072-4711-bd71-50a74aab4be8:1) A > JMS MessageConsumer has been closed: JmsConsumerInfo: { > ID:986cd915-d072-4711-bd71-50a74aab4be8:1:1:1, destination = > broadcast.XX.PRISMA_LIVERiskLimitTrigger } > ^[[0m^[[33m2021-03-25 16:52:00,410 WARN > [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel > (camel-1) thread #3 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger]) > Setup of JMS message listener invoker failed for destination > 'broadcast.XX.PRISMA_LIVERiskLimitTrigger' - trying to recover. Cause: Remote > did not respond to a drain request in time > ^[[0m^[[33m2021-03-25 16:52:01,414 WARN > [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel > (camel-1) thread #6 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger]) > Setup of JMS message listener invoker failed for destination > 'broadcast.XX.PRISMA_LIVERiskLimitTrigger' > {noformat} > After the problem was triggered broker started to reject client connections: > {noformat} > 2021-03-25 17:55:43,716 INFO [Broker-Config] (q.m.c.open) - > [con:901(/127.0.0.1:56012)] CON-1001 : Open : Destination : > amqp(127.0.0.1:22101) : Protocol Version : 1.0 > 2021-03-25 17:55:43,745 INFO [IO-/127.0.0.1:56012] (q.m.c.open) - > [con:901(ARP@/127.0.0.1:56012/default)] CON-1001 : Open : Destination : > amqp(172.16.75.197:22101) : Protocol Version : 1.0 : Client ID : > ID:af50d087-0610-406a-9792-ffa5bf6b68b4:5 : Client Version : 0.51.0 : Client > Product : QpidJMS > 2021-03-25 17:55:43,747 INFO [IO-/127.0.0.1:56012] (q.m.a.denied) - > [con:901(ARP@/127.0.0.1:56012/default)] ACL-1002 : Denied : Access > Virtualhost NAME=default,VIRTUALHOST_NAME=default > 2021-03-25 17:55:43,755 INFO [Broker-Config] (q.m.c.dropped_connection) - > [con:901(ARP@/127.0.0.1:56012/default)] CON-1004 : Connection dropped > {noformat} > Issue seems to be quite rare: same test ran in a loop over several days and > this issue was not triggered. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org