[
https://issues.apache.org/jira/browse/ARTEMIS-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482629#comment-17482629
]
Anders Wallgren commented on ARTEMIS-3583:
------------------------------------------
[~riconeubauer] We see that exception as well. It happens on a system that
isn't very busy, so it seems spurious to me, too.
> Race condition in QueueImpl#expireReferences(java.lang.Runnable) can cause
> spurious AMQ224013 warnings to be emitted.
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: ARTEMIS-3583
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3583
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.18.0
> Environment: Large-core machines running embedded ActiveMQ broker.
> Reporter: Anders Wallgren
> Priority: Major
>
> A recent change - [https://github.com/apache/activemq-artemis/pull/3654] -
> appears to introduce a race condition in QueueImpl#expireReferences(Runnable)
> that can cause the Runnable "done" parameter to never be invoked. This
> manifests itself as spurious "AMQ224013: failed to expire messages for queue"
> messages.
> The problem is this section of code and the non-atomic test-and-set on the
> expiryScanner.scannerRunning AtomicInteger:
> {code:java}
> if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) { <--- "test"
> if (expiryScanner.scannerRunning.incrementAndGet() == 1) { <--- "and-set"
> expiryScanner.doneCallback = done;
> }
> getExecutor().execute(expiryScanner);
> } else {
> // expire is already happening on this queue, move on!
> if (done != null) {
> done.run();
> }
> {code}
> Consider the following sequence of events and see if you agree with my
> analysis:
> # Two threads, t1 and t2
> # t1 runs PostOfficeImpl.ExpiryReaper which calls
> QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration
> is "done". It waits 10 seconds for the latch to drop. If the latch doesn't
> drop within the timeout it issues a AMQ224013 warning.
> # t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which
> in turn calls QueueImpl.expireReferences(null)
> # t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they
> both enter that branch. <--- This is where things start to go wrong, as only
> one of the two threads will successfully set expiryScanner.done; if the
> thread that "loses" is the one that supplies a non-null "done" parameter then
> that Runnable will never be invoked.
> # t2 successfully increments expiryScanner.scannerRunning to 1 and sets
> expiryScanner.doneCallback = null (since it passed null as the "done"
> parameter) and then continues to run expiryScanner.
> # t1 doesn't increment expiryScanner.scannerRunning and at this point the
> "done" argument is lost and will never be invoked, guaranteeing the AMQ224013
> warning. t1 then also runs the expiryScanner (which has already been
> submitted once by t2)
> Because of the non-atomic test-and-set, the callback is not guaranteed to be
> invoked and, also, more than one expiry run will execute simultaneously (this
> seems to be undesirable based on why this change was made in the first place).
> Should the code not be something like this (I'm not 100% familiar with the
> semantics of scannerRunning being > 1, so this may not be the correct
> solution):
>
> {code:java}
> if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) {
> <--- test-and-set
> expiryScanner.doneCallback = done;
> getExecutor().execute(expiryScanner);
> } else {
> // expire is already happening on this queue, move on!
> if (done != null) {
> done.run();
> }
> {code}
>
> This guarantees that only one thread can enter the section that sets
> expiryScanner.doneCallback and submits expiryScanner for execution; all other
> threads will immediately have their Runnable invoked.
> We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while
> testing on very large instances (64 CPUs and up) that are very busy.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)