[
https://issues.apache.org/jira/browse/AMQ-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christopher L. Shannon updated AMQ-9625:
----------------------------------------
Description:
For the last several years I have been occasionaly seeing "stuck" messages that
appear on queues that will not be dispatched until after a broker restart. The
bug looks to be the same as described in
https://issues.apache.org/jira/browse/AMQ-2955 (the root cause was never
figured out, it was just closed as they couldn't reproduce it). The resulting
behavior seen is that KahaDB has the batch cursor set to a point after a
message that is stored so that message will never dispatch. There's been some
other work done previously to help this issue, notably AMQ-3149 made a change
to only use concurrentStoreAndDispatch if the cache is enabled (more details in
that issue and more below)
I recently figured out how to reproduce it in a test environment and finally
tracked down what the root cause and a fix. Not surprisingly, there are a few
things at play here and the bug is a race condition so it won't be seen unless
a bunch of things hold true (and if the broker is configured a certain way)
h3. Background:
There are 2 optimizations that the broker uses that are playing into this and
both must be enabled for the issue to happen.
# {{useCache=true}} , The normal flow for incoming messages is that they get
written to the store and then they get paged off disk (same thread or another
thread) to be dispatched to consumers. However, there's also a message cache
and if enabled and if there's free memory, the message will be added to the the
cache after sending to disk so we don't need to re-read it off disk again later
when dispatching.
# {{concurrentStoreAndDispatchQueues=true}} The broker also has an
optimization for queues where it it will try and dispatch incoming messages
concurrently to consumers while also writing to disk. (Async message writes) if
the consumers are fast enough to ack, we can cancel the disk write which saves
disk IO and this obviously is a benefit for slow disks. This requires the cache
to be enabled as described in AMQ-3149 otherwise we run into problems because
messages won't be dispatched if not finished so this mode is really only useful
for the cache being enabled. Furthemore
The two settings work together and in practice this means the flow ends up
being that the message is submitted to the store to be added as part of an
async task that is queued up in the background by the store. While the task is
in the queue, the message is then concurrently added to the in memory cache and
the broker will proceed to dispatch to consumers, who may or may not
acknowledge dispatched messages before the disk write is finished if the
consumers are fast and keeping up. Messages that were already written are
removed like normal but if the async task was not finished it gets cancelled
and saves a disk write.
h3. Bug description:
When the broker runs out of memory to cache messages, the cache has to be
[disabled|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L258].
As part of this process the cache has to
[tell|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L336]
the store what the last message is that was cached so that when the cache is
exhausted we can resume paging off disk and dispatching in the correct spot.
The process for disabling the cache starts when a new incoming message is
attempted to be added to the cache and it detects that memory is full. When
this happens the process for disabling and syncing to the store starts and the
cache goes through and makes sure any previously cached messages that may be
pending to be written are completed (either acked and cancelled or written to
disk and completed) and after that will tell the store where to resume, which
would be after the last cached message. When the cache is disabled, new writes
should no longer be async because we need to have the messages written to disk
to be dispatched. AMQ-3149 added the initial check to ensure that messages are
not async if the cache is not enabled.
In theory, because the store was told the last cached message, the new incoming
message that triggered the disabling/sync would be eventually paged off disk
and dispatched. However, there is a race condition bug and what is actually
happening is sometimes the new incoming message has not completed the write to
the store when the queue goes to fill the next batch to dispatch, so it gets
missed as it's still pending. In this case the message that triggered the
disabling/sync was submitted to the cache but never actually cached because
memory was full, and then the dispatch continues and proceeds before
guaranteeing the write.
The end result is that if the consumer is very fast, when the store goes to
page off the next messages it may not see the pending write that hasn't
finished (and was not added to the cache) so the store skips ahead before the
incoming message completes. By the time the incoming message is finished
writing to the store, the disk cursor has moved past it and the message will be
skipped and gets stuck until the broker restarts to reset the batch.
This can happen repeatedly for cache enables/disables which is why you might
see 1 stuck message or more if it repeatedly happens.
h3. Solution:
The solution is actually rather simple and a couple lines of code. Because the
incoming message that was attempted to be added to the cache was not added to
the cache (because memory was full), we just need to wait for that message that
triggered the disable and sync to finish its task of writing to the store so
that it will be visible when reading in the next batch and paging off disk.
This guarantees the message won't be missed and no more stuck message.
was:
For the last several years I have been occasionaly seeing "stuck" messages that
appear on queues that will not be dispatched until after a broker restart. The
bug looks to be the same behavior as described in
https://issues.apache.org/jira/browse/AMQ-2955 (the root cause was never
figured out, it was just closed as they couldn't reproduce it). The resulting
behavior seen is that KahaDB has the batch cursor set to a point after a
message that is stored so that message will never dispatch. There's been some
other work done previously to help this issue, notably AMQ-3149 made a change
to only use concurrentStoreAndDispatch if the cache is enabled (more details in
that issue and more below)
I recently figured out how to reproduce it in a test environment and finally
tracked down what the root cause and a fix. Not surprisingly, there are a few
things at play here and the bug is a race condition so it won't be seen unless
a bunch of things hold true (and if the broker is configured a certain way)
h3. Background:
There are 2 optimizations that the broker uses that are playing into this and
both must be enabled for the issue to happen.
# {{useCache=true}} , The normal flow for incoming messages is that they get
written to the store and then they get paged off disk (same thread or another
thread) to be dispatched to consumers. However, there's also a message cache
and if enabled and if there's free memory, the message will be added to the the
cache after sending to disk so we don't need to re-read it off disk again later
when dispatching.
# {{concurrentStoreAndDispatchQueues=true}} The broker also has an
optimization for queues where it it will try and dispatch incoming messages
concurrently to consumers while also writing to disk. (Async message writes) if
the consumers are fast enough to ack, we can cancel the disk write which saves
disk IO and this obviously is a benefit for slow disks. This requires the cache
to be enabled as described in AMQ-3149 otherwise we run into problems because
messages won't be dispatched if not finished so this mode is really only useful
for the cache being enabled. Furthemore
The two settings work together and in practice this means the flow ends up
being that the message is submitted to the store to be added as part of an
async task that is queued up in the background by the store. While the task is
in the queue, the message is then concurrently added to the in memory cache and
the broker will proceed to dispatch to consumers, who may or may not
acknowledge dispatched messages before the disk write is finished if the
consumers are fast and keeping up. Messages that were already written are
removed like normal but if the async task was not finished it gets cancelled
and saves a disk write.
h3. Bug description:
When the broker runs out of memory to cache messages, the cache has to be
[disabled|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L258].
As part of this process the cache has to
[tell|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L336]
the store what the last message is that was cached so that when the cache is
exhausted we can resume paging off disk and dispatching in the correct spot.
The process for disabling the cache starts when a new incoming message is
attempted to be added to the cache and it detects that memory is full. When
this happens the process for disabling and syncing to the store starts and the
cache goes through and makes sure any previously cached messages that may be
pending to be written are completed (either acked and cancelled or written to
disk and completed) and after that will tell the store where to resume, which
would be after the last cached message. When the cache is disabled, new writes
should no longer be async because we need to have the messages written to disk
to be dispatched. AMQ-3149 added the initial check to ensure that messages are
not async if the cache is not enabled.
In theory, because the store was told the last cached message, the new incoming
message that triggered the disabling/sync would be eventually paged off disk
and dispatched. However, there is a race condition bug and what is actually
happening is sometimes the new incoming message has not completed the write to
the store when the queue goes to fill the next batch to dispatch, so it gets
missed as it's still pending. In this case the message that triggered the
disabling/sync was submitted to the cache but never actually cached because
memory was full, and then the dispatch continues and proceeds before
guaranteeing the write.
The end result is that if the consumer is very fast, when the store goes to
page off the next messages it may not see the pending write that hasn't
finished (and was not added to the cache) so the store skips ahead before the
incoming message completes. By the time the incoming message is finished
writing to the store, the disk cursor has moved past it and the message will be
skipped and gets stuck until the broker restarts to reset the batch.
This can happen repeatedly for cache enables/disables which is why you might
see 1 stuck message or more if it repeatedly happens.
h3. Solution:
The solution is actually rather simple and a couple lines of code. Because the
incoming message that was attempted to be added to the cache was not added to
the cache (because memory was full), we just need to wait for that message that
triggered the disable and sync to finish its task of writing to the store so
that it will be visible when reading in the next batch and paging off disk.
This guarantees the message won't be missed and no more stuck message.
> Messages can become stuck on Queues
> -----------------------------------
>
> Key: AMQ-9625
> URL: https://issues.apache.org/jira/browse/AMQ-9625
> Project: ActiveMQ Classic
> Issue Type: Bug
> Affects Versions: 5.18.6, 6.1.4
> Reporter: Christopher L. Shannon
> Assignee: Christopher L. Shannon
> Priority: Major
> Fix For: 6.2.0, 5.18.7, 6.1.5
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> For the last several years I have been occasionaly seeing "stuck" messages
> that appear on queues that will not be dispatched until after a broker
> restart. The bug looks to be the same as described in
> https://issues.apache.org/jira/browse/AMQ-2955 (the root cause was never
> figured out, it was just closed as they couldn't reproduce it). The resulting
> behavior seen is that KahaDB has the batch cursor set to a point after a
> message that is stored so that message will never dispatch. There's been
> some other work done previously to help this issue, notably AMQ-3149 made a
> change to only use concurrentStoreAndDispatch if the cache is enabled (more
> details in that issue and more below)
> I recently figured out how to reproduce it in a test environment and finally
> tracked down what the root cause and a fix. Not surprisingly, there are a few
> things at play here and the bug is a race condition so it won't be seen
> unless a bunch of things hold true (and if the broker is configured a certain
> way)
> h3. Background:
> There are 2 optimizations that the broker uses that are playing into this and
> both must be enabled for the issue to happen.
> # {{useCache=true}} , The normal flow for incoming messages is that they get
> written to the store and then they get paged off disk (same thread or another
> thread) to be dispatched to consumers. However, there's also a message cache
> and if enabled and if there's free memory, the message will be added to the
> the cache after sending to disk so we don't need to re-read it off disk again
> later when dispatching.
> # {{concurrentStoreAndDispatchQueues=true}} The broker also has an
> optimization for queues where it it will try and dispatch incoming messages
> concurrently to consumers while also writing to disk. (Async message writes)
> if the consumers are fast enough to ack, we can cancel the disk write which
> saves disk IO and this obviously is a benefit for slow disks. This requires
> the cache to be enabled as described in AMQ-3149 otherwise we run into
> problems because messages won't be dispatched if not finished so this mode is
> really only useful for the cache being enabled. Furthemore
> The two settings work together and in practice this means the flow ends up
> being that the message is submitted to the store to be added as part of an
> async task that is queued up in the background by the store. While the task
> is in the queue, the message is then concurrently added to the in memory
> cache and the broker will proceed to dispatch to consumers, who may or may
> not acknowledge dispatched messages before the disk write is finished if the
> consumers are fast and keeping up. Messages that were already written are
> removed like normal but if the async task was not finished it gets cancelled
> and saves a disk write.
> h3. Bug description:
> When the broker runs out of memory to cache messages, the cache has to be
> [disabled|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L258].
> As part of this process the cache has to
> [tell|https://github.com/apache/activemq/blob/3400983a22284a28a8989d4b0aaf762090b0911a/activemq-broker/src/main/java/org/apache/activemq/broker/region/cursors/AbstractStoreCursor.java#L336]
> the store what the last message is that was cached so that when the cache is
> exhausted we can resume paging off disk and dispatching in the correct spot.
> The process for disabling the cache starts when a new incoming message is
> attempted to be added to the cache and it detects that memory is full. When
> this happens the process for disabling and syncing to the store starts and
> the cache goes through and makes sure any previously cached messages that may
> be pending to be written are completed (either acked and cancelled or written
> to disk and completed) and after that will tell the store where to resume,
> which would be after the last cached message. When the cache is disabled, new
> writes should no longer be async because we need to have the messages written
> to disk to be dispatched. AMQ-3149 added the initial check to ensure that
> messages are not async if the cache is not enabled.
> In theory, because the store was told the last cached message, the new
> incoming message that triggered the disabling/sync would be eventually paged
> off disk and dispatched. However, there is a race condition bug and what is
> actually happening is sometimes the new incoming message has not completed
> the write to the store when the queue goes to fill the next batch to
> dispatch, so it gets missed as it's still pending. In this case the message
> that triggered the disabling/sync was submitted to the cache but never
> actually cached because memory was full, and then the dispatch continues and
> proceeds before guaranteeing the write.
> The end result is that if the consumer is very fast, when the store goes to
> page off the next messages it may not see the pending write that hasn't
> finished (and was not added to the cache) so the store skips ahead before the
> incoming message completes. By the time the incoming message is finished
> writing to the store, the disk cursor has moved past it and the message will
> be skipped and gets stuck until the broker restarts to reset the batch.
> This can happen repeatedly for cache enables/disables which is why you might
> see 1 stuck message or more if it repeatedly happens.
> h3. Solution:
> The solution is actually rather simple and a couple lines of code. Because
> the incoming message that was attempted to be added to the cache was not
> added to the cache (because memory was full), we just need to wait for that
> message that triggered the disable and sync to finish its task of writing to
> the store so that it will be visible when reading in the next batch and
> paging off disk. This guarantees the message won't be missed and no more
> stuck message.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact