Anders Wallgren created ARTEMIS-3583:
----------------------------------------
Summary: Race condition in
QueueImpl#expireReferences(java.lang.Runnable) can cause spurious AMQ224013
warnings to be emitted.
Key: ARTEMIS-3583
URL: https://issues.apache.org/jira/browse/ARTEMIS-3583
Project: ActiveMQ Artemis
Issue Type: Bug
Components: Broker
Affects Versions: 2.18.0
Environment: Large-core machines running embedded ActiveMQ broker.
Reporter: Anders Wallgren
A recent change - [https://github.com/apache/activemq-artemis/pull/3654] -
appears to introduce a race condition in QueueImpl#expireReferences(Runnable)
that can cause the Runnable to never be invoked..
This manifests itself as spurious "AMQ224013: failed to expire messages for
queue" messages.
The problem is this section of code and the non-atomic test-and-set on the
expiryScanner.scannerRunning AtomicInteger:
{{if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) {}}
{{ if (expiryScanner.scannerRunning.incrementAndGet() == 1) {}}
{{ expiryScanner.doneCallback = done;}}
{{ }}}
{{ getExecutor().execute(expiryScanner);}}
{{} else {}}
{{ // expire is already happening on this queue, move on!}}
{{ if (done != null) {}}
{{ done.run();}}
{{ }}}
{{}}}
Consider the following sequence of events and see if you agree with my analysis:
# Two threads, t1 and t2
# t1 runs PostOfficeImpl.ExpiryReaper which calls
QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration is
"done". It waits 10 seconds for the latch to drop. If the latch doesn't drop
within the timeout it issues a AMQ224013 warning.
# t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which in
turn calls QueueImpl.expireReferences(null)
# t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they both
enter that branch.
# t2 successfully increments expiryScanner.scannerRunning to 1 and sets
expiryScanner.doneCallback = null (since it passed null as the "done"
parameter) and then continues to run expiryScanner.
# t1 doesn't increment expiryScanner.scannerRunning and at this point the
"done" argument is lost and will never be invoked, guaranteeing the AMQ224013
warning. t1 then also runs the expiryScanner (which has already been submitted
once by t2)
Because of the non-atomic test-and-set, the callback is not guaranteed to be
invoked and, also, more than one expiry run will execute simultaneously (this
seems to be undesirable based on why this change was made in the first place).
Should the code not be something like this:
{{if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) {}}
{{ expiryScanner.doneCallback = done;}}
{{ getExecutor().execute(expiryScanner);}}
{{} else {}}
{{ // expire is already happening on this queue, move on!}}
{{ if (done != null) {}}
{{ done.run();}}
{{ }}}
{{}}}
This guarantees that only one thread can enter the section that sets
expiryScanner.doneCallback and submits expiryScanner for execution; all other
threads will immediately have their Runnable invoked.
We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while
testing on very large instances (64 CPUs and up) that are very busy.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)