Christopher L. Shannon created AMQ-9625:
-------------------------------------------

             Summary: Messages can become stuck on Queues
                 Key: AMQ-9625
                 URL: https://issues.apache.org/jira/browse/AMQ-9625
             Project: ActiveMQ Classic
          Issue Type: Bug
    Affects Versions: 6.1.4, 5.18.6
            Reporter: Christopher L. Shannon
            Assignee: Christopher L. Shannon
             Fix For: 6.2.0, 5.18.7, 6.1.5


For the last several years I have been occasionaly seeing "stuck" messages that 
appear on queues that will not be dispatched until after a broker restart. The 
bug is the same as described in https://issues.apache.org/jira/browse/AMQ-2955 
(the root cause was never figured out, it was just closed as they couldn't 
reproduce it). The resulting behavior seen is that KahaDB has the batch cursor 
set to a point after a message that is stored so that message will never 
dispatch.

I recently figured out how to reproduce it in a test environment and finally 
tracked down what the root cause and a fix. Not surprisingly, there are a few 
things at play here and the bug is a race condition so it won't be seen unless 
a bunch of things hold true (and if the broker is configured a certain way)


Background: 

There are 2 optimizations that the broker uses that are playing into this and 
both must be enabled for the issue to happen.


1) useCache=true The normal flow for incoming messages is that they get written 
to the store and then they get paged off disk (same thread or another thread) 
to be dispatched to consumers. However, there's also a message cache and if 
enabled and if there's free memory, the message will be added to the the cache 
after sending to disk so we don't need to re-read it off disk again later when 
dispatching.
2) concurrentStoreAndDispatchQueues=true The broker also has an optimization 
for queues where it it will try and dispatch incoming messages concurrently to 
consumers while also writing to disk. if the consumers are fast enough to ack, 
we can cancel the disk write which saves disk IO and this obviously is a 
benefit for slow disks. This requires the cache to be enabled to be beneficial 
as without the cache enabled the message would not be visible in the cursor to 
dispatch to a consumer until the write finished.

The two settings work together and in practice this means the flow ends up 
being that the message is submitted to the store to be added as part of an 
async task that is queued up in the background by the store. While the task is 
in the queue, the message is then concurrently added to the in memory cache and 
the broker will proceed to dispatch to consumers, who may or may not 
acknowledge dispatched messages before the disk write is finished if the 
consumers are fast and keeping up. Messages that were already written are 
removed like normal but if the async task was not finished it gets cancelled 
and saves a disk write.


Bug description:

When the broker runs out of memory to cache messages, the cache has to be 
disabled. As part of this process the cache has to tell the store what the last 
message is that was cached so that when the cache is exhausted we can resume 
paging off disk and dispatching in the correct spot. 

The process for disabling the cache starts when a new incoming message is 
attempted to be added to the cache and it detects that memory is full. When 
this happens the process for disabling and syncing to the store starts and the 
cache goes through and makes sure any previously cached messages that may be 
pending to be written are completed (either acked and cancelled or writen to 
disk and completed) and after that will tell the store where to resume, which 
would be after the last cached message. When the cache is disabled, new writes 
should no longer be async because we need to have the messages written to disk 
to be dispatched.

In theory, because the store was told the last cached message, the new incoming 
message that triggered the disabling/sync would be eventually paged off disk 
and dispatched. However, there is a race condition bug and what is actually 
happening is sometimes the new incoming message has not completed the write to 
the store when the queue goes to fill the next batch to dispatch, so it gets 
missed as it's still pending. In this case the message that triggered the 
disabling/sync was submitted to the cache but never actually cached because 
memory was full, and then the dispatch continues and proceeds before 
guaranteeing the write.

The end result is that if the consumer is very fast, when the store goes to 
page off the next messages it may not see the pending write that hasn't 
finished (and was not added to the cache) so the store skips ahead before the 
incoming message completes. By the time the incoming message is finished 
writing to the store, the disk cursor has moved past it and the message will be 
skipped and gets stuck until the broker restarts to reset the batch.

This can happen repeatedly for cache enables/disables which is why you might 
see 1 stuck message or more if it repeatedly happens.

Solution:

The solution is actually rather simple and a couple lines of code. Because the 
incoming message that was attempted to be added to the cache was not added to 
the cache (because memory was full), we just need to wait for that message that 
triggered the disable and sync to finish its task of writing to the store so 
that it will be visible when reading in the next batch and paging off disk. 
This guarantees the message won't be missed and no more stuck message. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact


Reply via email to