[ https://issues.apache.org/jira/browse/ARTEMIS-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Justin Bertram resolved ARTEMIS-4522. ------------------------------------- Fix Version/s: 2.32.0 Resolution: Fixed > Flow control may get delayed significantly on occupied pool > ----------------------------------------------------------- > > Key: ARTEMIS-4522 > URL: https://issues.apache.org/jira/browse/ARTEMIS-4522 > Project: ActiveMQ Artemis > Issue Type: Improvement > Reporter: Rico Neubauer > Priority: Major > Fix For: 2.32.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Slack discussion with [~jbertram] see here: > [https://the-asf.slack.com/archives/CFL910J30/p1699520764243169] > h2. Initial Description > consumers-window-size set to 0 and there some queues, that are heavily used > and have consumers attached, that allow for a concurrency of 500 messages > (larger than thread-pool-client-size). > After some time - unclear what triggers it - the behavior becomes bad in the > way that from those queues only 1-5 messages are fetched and immediately > processed, but then there is a pause of 10 seconds before the next 1-5 > messages are fetched. This continues. > In the thread-dumps, several like the following can be seen: > {noformat} > "Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0 > cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29 > runnable [0x00007f8fdbae5000] > java.lang.Thread.State: TIMED_WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method) > - parking to wait for <0x00000003f6c9d1d0> (a > org.apache.activemq.artemis.utils.AbstractLatch$CountSync) > at > java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.5/LockSupport.java:252) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.5/AbstractQueuedSynchronizer.java:717) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@17.0.5/AbstractQueuedSynchronizer.java:1074) > at > org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115) > at > org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869) > at > org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025) > at > org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154) > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown > Source) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.5/ThreadPoolExecutor.java:1136) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.5/ThreadPoolExecutor.java:635) > at > org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) > {noformat} > which corresponds to code location with a 10 seconds sleep here: > [https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854|https://github.com/seeburger-ag/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854] > It seems it always needs to wait until the 10 seconds are reached, which is > unexpected, since the latch should count-down before, after having sent the > credit to the server. We could assume some bug, that maybe increments the > latch, but misses to decrement, so it is not able to reach 0 ever again, but > that's speculative. > h2. Evaluation > See slack for details, it boils down to the thread-pool in use being out of > free threads, so the slow-consumer-handling is not able to run in good time. > Affected versions: all since at least 2.22.0 up to current - probably since > "ever", involved code derives from HornetQ. > h2. Possible Solutions > We experimented a bit and found a solution for us - or actually 2 variants. > * No. 1 just uses an own single-threaded executor for flow-control, whose > only job is to guarantee the credit was sent before continuing with other > operations when starting/resetting slow-consumers. > This looks simple and works perfectly in the test-case as well as in the real > product in stress-tests > [https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e] > * No. 2 has a similar idea, but goes the full way of adding an own > thread-pool for flow-control and adding it like the others. > This is more fitting into the style of the existing thread-pools and allows > its configuration. > [https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0] > and [~jbertram] said: > {quote}I think either one would be fine. Create a Jira and send a PR. > {quote} > So here is the Jira and I will push No. 2 as a pull-request: > https://github.com/apache/activemq-artemis/pull/4699 -- This message was sent by Atlassian Jira (v8.20.10#820010)