[
https://issues.apache.org/jira/browse/QPID-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17318944#comment-17318944
]
Alex Rudyy edited comment on QPID-8514 at 4/11/21, 10:29 PM:
-------------------------------------------------------------
Hi [~daniel.kirilyuk]
I spent some time looking into the reported issue. I believe it is caused by a
defect in RingOverflowPolicyHandler.
The RingOverflowPolicy implementation is trying to delete the oldest message on
the queue which is evaluated by {{queue.getLeastSignificantOldestEntry()}}. If
that message is acquired by the consumer (and its state type is not instance of
{{StealableConsumerAcquiredState}}), the existing delete functionality cannot
steal the message (change it's state). It registers a spacial listener to
acquire and delete it later (after consumer release the message). If either of
queue depths, does not drop below the maximum allowed limits (which is the
case when max message depth is 1), the {{RingOverflowPolicyHandler}} can
continue iterating through the same code attempting to delete the same
consumer acquired message. This results in registration of duplicate
StateChangeListenerEntry on every loop. That would cause eventually a high CPU
consumption, as a linked list (keeping state change listeners) would grow in
size on every iteration.
If the consumer will not settle the message (due to some hang state), that
might cause an infinite loop in RingOverflowPolicyHandler on enqueue of a new
message. Eventually, the Broker will run out of memory due to a grow of
StateChangeListener list.
I think we can try to resolve this issue with the following:
* change an implementation {{queue.getLeastSignificantOldestEntry()}} to get
non-consumer acquired oldest message
* if there is no available entries, the {{RingOverflowPolicyHandler}} should
stop the deletion attempts and break the loop
Kind Regards,
Alex
was (Author: alex.rufous):
Hi [~daniel.kirilyuk]
I spent some time looking into the reported issue. I believe it is caused by a
defect in RingOverflowPolicyHandler.
The RingOverflowPolicy implementation is trying to delete the oldest message on
the queue which is evaluated by {{queue.getLeastSignificantOldestEntry()}}. If
that message is acquired by the consumer (and its state type is instance of
{{StealableConsumerAcquiredState}}), the existing delete functionality cannot
steal the message (change it's state). It registers a spacial listener to
acquire and delete it later (after consumer release the message). If either of
queue depths, does not drop below the maximum allowed limits (which is the
case when max message depth is 1), the {{RingOverflowPolicyHandler}} can
continue iterating through the same code attempting to delete the same
consumer acquired message. This results in registration of duplicate
StateChangeListenerEntry on every loop. That would cause eventually a high CPU
consumption, as a linked list (keeping state change listeners) would grow in
size on every iteration.
If the consumer will not settle the message (due to some hang state), that
might cause an infinite loop in RingOverflowPolicyHandler on enqueue of a new
message. Eventually, the Broker will run out of memory due to a grow of
StateChangeListener list.
I think we can try to resolve this issue with the following:
* change an implementation {{queue.getLeastSignificantOldestEntry()}} to get
non-consumer acquired oldest message
* if there is no available entries, the {{RingOverflowPolicyHandler}} should
stop the deletion attempts and break the loop
Kind Regards,
Alex
> High CPU usage and stucked connections
> --------------------------------------
>
> Key: QPID-8514
> URL: https://issues.apache.org/jira/browse/QPID-8514
> Project: Qpid
> Issue Type: Bug
> Components: Broker-J
> Affects Versions: qpid-java-broker-8.0.4
> Reporter: Daniil Kirilyuk
> Priority: Minor
> Attachments: RingPolicyConcurrencyTest.java, threaddump.zip
>
>
> There were around 600 connections in state CLOSE__WAIT:
> {noformat}
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep TIME_WAIT | wc -l
> 38
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep CLOSE_WAIT | wc -l
> 625
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep -v CLOSE_WAIT | grep
> -V TIME_WAIT | wc -l
> 7
> {noformat}
> CPU usage raises up to 90% and higher.
> Connections couldn't be closed by webgui. Broker log contains message below,
> but connections were not closed.
> {noformat}
> 2021-03-26 15:12:28,474 INFO [Broker-Config] (q.m.c.model_delete) -
> [mng:3hpU+miq(admin@/127.0.0.1:62743)]
> [con:751(TEST@/127.0.0.1:40422/default)] CON-1007 : Connection close
> initiated by operator
> {noformat}
> Thread dumps shows around 49 threads with same stacktrace related to ring
> queue
> {noformat}
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.qpid.server.queue.QueueEntryImpl.addStateChangeListener(QueueEntryImpl.java:674)
> at
> org.apache.qpid.server.queue.QueueEntryImpl.acquireOrSteal(QueueEntryImpl.java:316)
> at
> org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1913)
> at
> org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1863)
> at
> org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.checkOverflow(RingOverflowPolicyHandler.java:102)
> at
> org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.access$000(RingOverflowPolicyHandler.java:44)
> at
> org.apache.qpid.server.queue.RingOverflowPolicyHandler.checkOverflow(RingOverflowPolicyHandler.java:41)
> at
> org.apache.qpid.server.queue.AbstractQueue.doEnqueue(AbstractQueue.java:1382)
> at
> org.apache.qpid.server.queue.AbstractQueue.enqueue(AbstractQueue.java:1290)
> at
> org.apache.qpid.server.message.RoutingResult$1.postCommit(RoutingResult.java:135)
> at
> org.apache.qpid.server.txn.AsyncAutoCommitTransaction$3.postCommit(AsyncAutoCommitTransaction.java:309)
> at
> org.apache.qpid.server.txn.AsyncCommand.complete(AsyncCommand.java:84)
> at
> org.apache.qpid.server.protocol.v1_0.StandardReceivingLinkEndpoint.receiveComplete(StandardReceivingLinkEndpoint.java:588)
> at
> org.apache.qpid.server.protocol.v1_0.Session_1_0.lambda$receivedComplete$5(Session_1_0.java:1343)
> at
> org.apache.qpid.server.protocol.v1_0.Session_1_0$$Lambda$142/959267941.accept(Unknown
> Source)
> at java.lang.Iterable.forEach(Iterable.java:75)
> at
> org.apache.qpid.server.protocol.v1_0.Session_1_0.receivedComplete(Session_1_0.java:1343)
> at
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.lambda$receivedComplete$11(AMQPConnection_1_0Impl.java:1349)
> at
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl$$Lambda$141/1774823837.run(Unknown
> Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.receivedComplete(AMQPConnection_1_0Impl.java:1347)
> at
> org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.onReceive(AMQPConnection_1_0Impl.java:1328)
> at
> org.apache.qpid.server.transport.AbstractAMQPConnection.lambda$received$2(AbstractAMQPConnection.java:576)
> at
> org.apache.qpid.server.transport.AbstractAMQPConnection$$Lambda$119/6380061.run(Unknown
> Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at
> org.apache.qpid.server.transport.AbstractAMQPConnection.received(AbstractAMQPConnection.java:571)
> at
> org.apache.qpid.server.transport.MultiVersionProtocolEngine.received(MultiVersionProtocolEngine.java:135)
> at
> org.apache.qpid.server.transport.NonBlockingConnection.processAmqpData(NonBlockingConnection.java:611)
> at
> org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.processData(NonBlockingConnectionPlainDelegate.java:58)
> at
> org.apache.qpid.server.transport.NonBlockingConnection.doRead(NonBlockingConnection.java:496)
> at
> org.apache.qpid.server.transport.NonBlockingConnection.doWork(NonBlockingConnection.java:270)
> at
> org.apache.qpid.server.transport.NetworkConnectionScheduler.processConnection(NetworkConnectionScheduler.java:134)
> at
> org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.processConnection(SelectorThread.java:575)
> at
> org.apache.qpid.server.transport.SelectorThread$SelectionTask.performSelect(SelectorThread.java:366)
> at
> org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(SelectorThread.java:97)
> at
> org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:533)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
> at
> org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$10/1487500813.run(Unknown
> Source)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Also ring queue behaves strangely: queue "{{DeadLetterQueue"}} has set
> maximum message depth to {{1}}, but in there were 55 message with
> {{Acquired}} state.
> First message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.381 (00:00
> UTC)}}.
> Last message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.683 (00:00
> UTC)}}.
> These times correlates to timed out drain request at 16:52:00,410, because
> default drain request is set 60 seconds:
> {noformat}
> ^[[0m^[[0m2021-03-25 16:52:00,408 INFO [org.apache.qpid.jms.JmsSession]
> (QpidJMS Connection Executor: ID:986cd915-d072-4711-bd71-50a74aab4be8:1) A
> JMS MessageConsumer has been closed: JmsConsumerInfo: {
> ID:986cd915-d072-4711-bd71-50a74aab4be8:1:1:1, destination =
> broadcast.XX.PRISMA_LIVERiskLimitTrigger }
> ^[[0m^[[33m2021-03-25 16:52:00,410 WARN
> [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel
> (camel-1) thread #3 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger])
> Setup of JMS message listener invoker failed for destination
> 'broadcast.XX.PRISMA_LIVERiskLimitTrigger' - trying to recover. Cause: Remote
> did not respond to a drain request in time
> ^[[0m^[[33m2021-03-25 16:52:01,414 WARN
> [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel
> (camel-1) thread #6 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger])
> Setup of JMS message listener invoker failed for destination
> 'broadcast.XX.PRISMA_LIVERiskLimitTrigger'
> {noformat}
> After the problem was triggered broker started to reject client connections:
> {noformat}
> 2021-03-25 17:55:43,716 INFO [Broker-Config] (q.m.c.open) -
> [con:901(/127.0.0.1:56012)] CON-1001 : Open : Destination :
> amqp(127.0.0.1:22101) : Protocol Version : 1.0
> 2021-03-25 17:55:43,745 INFO [IO-/127.0.0.1:56012] (q.m.c.open) -
> [con:901(ARP@/127.0.0.1:56012/default)] CON-1001 : Open : Destination :
> amqp(172.16.75.197:22101) : Protocol Version : 1.0 : Client ID :
> ID:af50d087-0610-406a-9792-ffa5bf6b68b4:5 : Client Version : 0.51.0 : Client
> Product : QpidJMS
> 2021-03-25 17:55:43,747 INFO [IO-/127.0.0.1:56012] (q.m.a.denied) -
> [con:901(ARP@/127.0.0.1:56012/default)] ACL-1002 : Denied : Access
> Virtualhost NAME=default,VIRTUALHOST_NAME=default
> 2021-03-25 17:55:43,755 INFO [Broker-Config] (q.m.c.dropped_connection) -
> [con:901(ARP@/127.0.0.1:56012/default)] CON-1004 : Connection dropped
> {noformat}
> Issue seems to be quite rare: same test ran in a loop over several days and
> this issue was not triggered.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]