[jira] [Work started] (AMQ-9625) Messages can become stuck on Queues

Christopher L. Shannon (Jira) Wed, 20 Nov 2024 10:41:17 -0800


     [ 
https://issues.apache.org/jira/browse/AMQ-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Work on AMQ-9625 started by Christopher L. Shannon.
---------------------------------------------------
> Messages can become stuck on Queues
> -----------------------------------
>
>                 Key: AMQ-9625
>                 URL: https://issues.apache.org/jira/browse/AMQ-9625
>             Project: ActiveMQ Classic
>          Issue Type: Bug
>    Affects Versions: 5.18.6, 6.1.4
>            Reporter: Christopher L. Shannon
>            Assignee: Christopher L. Shannon
>            Priority: Major
>             Fix For: 6.2.0, 5.18.7, 6.1.5
>
>
> For the last several years I have been occasionaly seeing "stuck" messages 
> that appear on queues that will not be dispatched until after a broker 
> restart. The bug is the same as described in 
> https://issues.apache.org/jira/browse/AMQ-2955 (the root cause was never 
> figured out, it was just closed as they couldn't reproduce it). The resulting 
> behavior seen is that KahaDB has the batch cursor set to a point after a 
> message that is stored so that message will never dispatch.
> I recently figured out how to reproduce it in a test environment and finally 
> tracked down what the root cause and a fix. Not surprisingly, there are a few 
> things at play here and the bug is a race condition so it won't be seen 
> unless a bunch of things hold true (and if the broker is configured a certain 
> way)
> h3. Background:
> There are 2 optimizations that the broker uses that are playing into this and 
> both must be enabled for the issue to happen.
>  # {{useCache=true}} , The normal flow for incoming messages is that they get 
> written to the store and then they get paged off disk (same thread or another 
> thread) to be dispatched to consumers. However, there's also a message cache 
> and if enabled and if there's free memory, the message will be added to the 
> the cache after sending to disk so we don't need to re-read it off disk again 
> later when dispatching.
>  # {{concurrentStoreAndDispatchQueues=true}} The broker also has an 
> optimization for queues where it it will try and dispatch incoming messages 
> concurrently to consumers while also writing to disk. if the consumers are 
> fast enough to ack, we can cancel the disk write which saves disk IO and this 
> obviously is a benefit for slow disks. This requires the cache to be enabled 
> to be beneficial as without the cache enabled the message would not be 
> visible in the cursor to dispatch to a consumer until the write finished.
> The two settings work together and in practice this means the flow ends up 
> being that the message is submitted to the store to be added as part of an 
> async task that is queued up in the background by the store. While the task 
> is in the queue, the message is then concurrently added to the in memory 
> cache and the broker will proceed to dispatch to consumers, who may or may 
> not acknowledge dispatched messages before the disk write is finished if the 
> consumers are fast and keeping up. Messages that were already written are 
> removed like normal but if the async task was not finished it gets cancelled 
> and saves a disk write.
> h3. Bug description:
> When the broker runs out of memory to cache messages, the cache has to be 
> [disabled|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L258].
>  As part of this process the cache has to 
> [tell|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L336]
>  the store what the last message is that was cached so that when the cache is 
> exhausted we can resume paging off disk and dispatching in the correct spot.
> The process for disabling the cache starts when a new incoming message is 
> attempted to be added to the cache and it detects that memory is full. When 
> this happens the process for disabling and syncing to the store starts and 
> the cache goes through and makes sure any previously cached messages that may 
> be pending to be written are completed (either acked and cancelled or writen 
> to disk and completed) and after that will tell the store where to resume, 
> which would be after the last cached message. When the cache is disabled, new 
> writes should no longer be async because we need to have the messages written 
> to disk to be dispatched.
> In theory, because the store was told the last cached message, the new 
> incoming message that triggered the disabling/sync would be eventually paged 
> off disk and dispatched. However, there is a race condition bug and what is 
> actually happening is sometimes the new incoming message has not completed 
> the write to the store when the queue goes to fill the next batch to 
> dispatch, so it gets missed as it's still pending. In this case the message 
> that triggered the disabling/sync was submitted to the cache but never 
> actually cached because memory was full, and then the dispatch continues and 
> proceeds before guaranteeing the write.
> The end result is that if the consumer is very fast, when the store goes to 
> page off the next messages it may not see the pending write that hasn't 
> finished (and was not added to the cache) so the store skips ahead before the 
> incoming message completes. By the time the incoming message is finished 
> writing to the store, the disk cursor has moved past it and the message will 
> be skipped and gets stuck until the broker restarts to reset the batch.
> This can happen repeatedly for cache enables/disables which is why you might 
> see 1 stuck message or more if it repeatedly happens.
> h3. Solution:
> The solution is actually rather simple and a couple lines of code. Because 
> the incoming message that was attempted to be added to the cache was not 
> added to the cache (because memory was full), we just need to wait for that 
> message that triggered the disable and sync to finish its task of writing to 
> the store so that it will be visible when reading in the next batch and 
> paging off disk. This guarantees the message won't be missed and no more 
> stuck message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact

[jira] [Work started] (AMQ-9625) Messages can become stuck on Queues

Reply via email to