[
https://issues.apache.org/jira/browse/ARTEMIS-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799520#comment-17799520
]
ASF subversion and git services commented on ARTEMIS-4522:
----------------------------------------------------------
Commit 9baad30827c10595b94868807c9d1a14b1125824 in activemq-artemis's branch
refs/heads/main from Justin Bertram
[ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=9baad30827 ]
ARTEMIS-4522 simplify test & complete implementation
> Slow-Consumer-Handling may get delayed significantly on occupied pool
> ---------------------------------------------------------------------
>
> Key: ARTEMIS-4522
> URL: https://issues.apache.org/jira/browse/ARTEMIS-4522
> Project: ActiveMQ Artemis
> Issue Type: Improvement
> Reporter: Rico Neubauer
> Priority: Major
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> Slack discussion with [~jbertram] see here:
> [https://the-asf.slack.com/archives/CFL910J30/p1699520764243169]
> h2. Initial Description
> consumers-window-size set to 0 and there some queues, that are heavily used
> and have consumers attached, that allow for a concurrency of 500 messages
> (larger than thread-pool-client-size).
> After some time - unclear what triggers it - the behavior becomes bad in the
> way that from those queues only 1-5 messages are fetched and immediately
> processed, but then there is a pause of 10 seconds before the next 1-5
> messages are fetched. This continues.
> In the thread-dumps, several like the following can be seen:
> {noformat}
> "Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0
> cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29
> runnable [0x00007f8fdbae5000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> - parking to wait for <0x00000003f6c9d1d0> (a
> org.apache.activemq.artemis.utils.AbstractLatch$CountSync)
> at
> java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:717)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos([email protected]/AbstractQueuedSynchronizer.java:1074)
> at
> org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025)
> at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> at
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
> at
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown
> Source)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
> at
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> {noformat}
> which corresponds to code location with a 10 seconds sleep here:
> [https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854|https://github.com/seeburger-ag/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854]
> It seems it always needs to wait until the 10 seconds are reached, which is
> unexpected, since the latch should count-down before, after having sent the
> credit to the server. We could assume some bug, that maybe increments the
> latch, but misses to decrement, so it is not able to reach 0 ever again, but
> that's speculative.
> h2. Evaluation
> See slack for details, it boils down to the thread-pool in use being out of
> free threads, so the slow-consumer-handling is not able to run in good time.
> Affected versions: all since at least 2.22.0 up to current - probably since
> "ever", involved code derives from HornetQ.
> h2. Possible Solutions
> We experimented a bit and found a solution for us - or actually 2 variants.
> * No. 1 just uses an own single-threaded executor for flow-control, whose
> only job is to guarantee the credit was sent before continuing with other
> operations when starting/resetting slow-consumers.
> This looks simple and works perfectly in the test-case as well as in the real
> product in stress-tests
> [https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e]
> * No. 2 has a similar idea, but goes the full way of adding an own
> thread-pool for flow-control and adding it like the others.
> This is more fitting into the style of the existing thread-pools and allows
> its configuration.
> [https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0]
> and [~jbertram] said:
> {quote}I think either one would be fine. Create a Jira and send a PR.
> {quote}
> So here is the Jira and I will push No. 2 as a pull-request:
> https://github.com/apache/activemq-artemis/pull/4699
--
This message was sent by Atlassian Jira
(v8.20.10#820010)