[jira] [Resolved] (ARTEMIS-4522) Flow control may get delayed significantly on occupied pool

Justin Bertram (Jira) Sat, 20 Jan 2024 14:31:05 -0800


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Justin Bertram resolved ARTEMIS-4522.
-------------------------------------
    Fix Version/s: 2.32.0
       Resolution: Fixed

> Flow control may get delayed significantly on occupied pool
> -----------------------------------------------------------
>
>                 Key: ARTEMIS-4522
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4522
>             Project: ActiveMQ Artemis
>          Issue Type: Improvement
>            Reporter: Rico Neubauer
>            Priority: Major
>             Fix For: 2.32.0
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Slack discussion with [~jbertram] see here:
> [https://the-asf.slack.com/archives/CFL910J30/p1699520764243169]
> h2. Initial Description
> consumers-window-size set to 0 and there some queues, that are heavily used 
> and have consumers attached, that allow for a concurrency of 500 messages 
> (larger than thread-pool-client-size).
> After some time - unclear what triggers it - the behavior becomes bad in the 
> way that from those queues only 1-5 messages are fetched and immediately 
> processed, but then there is a pause of 10 seconds before the next 1-5 
> messages are fetched. This continues.
> In the thread-dumps, several like the following can be seen:
> {noformat}
> "Thread-0 (ActiveMQ-client-global-threads)" #640 daemon prio=5 os_prio=0 
> cpu=899719.36ms elapsed=354425.44s tid=0x00007f904c02e850 nid=0x150d29 
> runnable  [0x00007f8fdbae5000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>       at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method)
>       - parking to wait for  <0x00000003f6c9d1d0> (a 
> org.apache.activemq.artemis.utils.AbstractLatch$CountSync)
>       at 
> java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.5/LockSupport.java:252)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.5/AbstractQueuedSynchronizer.java:717)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@17.0.5/AbstractQueuedSynchronizer.java:1074)
>       at 
> org.apache.activemq.artemis.utils.AbstractLatch.await(AbstractLatch.java:115)
>       at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.startSlowConsumer(ClientConsumerImpl.java:869)
>       at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1025)
>       at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1154)
>       at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>       at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>       at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>       at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$2008/0x00000007ce20c000.run(Unknown
>  Source)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.5/ThreadPoolExecutor.java:1136)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.5/ThreadPoolExecutor.java:635)
>       at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> {noformat}
> which corresponds to code location with a 10 seconds sleep here:
> [https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854|https://github.com/seeburger-ag/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientConsumerImpl.java#L854]
> It seems it always needs to wait until the 10 seconds are reached, which is 
> unexpected, since the latch should count-down before, after having sent the 
> credit to the server. We could assume some bug, that maybe increments the 
> latch, but misses to decrement, so it is not able to reach 0 ever again, but 
> that's speculative.
> h2. Evaluation
> See slack for details, it boils down to the thread-pool in use being out of 
> free threads, so the slow-consumer-handling is not able to run in good time.
> Affected versions: all since at least 2.22.0 up to current - probably since 
> "ever", involved code derives from HornetQ.
> h2. Possible Solutions
> We experimented a bit and found a solution for us - or actually 2 variants.
>  * No. 1 just uses an own single-threaded executor for flow-control, whose 
> only job is to guarantee the credit was sent before continuing with other 
> operations when starting/resetting slow-consumers.
> This looks simple and works perfectly in the test-case as well as in the real 
> product in stress-tests
> [https://github.com/seeburger-ag/activemq-artemis/commit/5b94560fc55fabcc15f138fbec6b2f855508cc7e]
>  * No. 2 has a similar idea, but goes the full way of adding an own 
> thread-pool for flow-control and adding it like the others.
> This is more fitting into the style of the existing thread-pools and allows 
> its configuration.
> [https://github.com/seeburger-ag/activemq-artemis/commit/b4d558c5fc90f3a442391ebcf4976e8d6ec7adb0]
> and [~jbertram] said:
> {quote}I think either one would be fine. Create a Jira and send a PR.
> {quote}
> So here is the Jira and I will push No. 2 as a pull-request: 
> https://github.com/apache/activemq-artemis/pull/4699



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (ARTEMIS-4522) Flow control may get delayed significantly on occupied pool

Reply via email to