[
https://issues.apache.org/jira/browse/ARTEMIS-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477232#comment-17477232
]
Rico Neubauer commented on ARTEMIS-3583:
----------------------------------------
Sporadically getting error below, which to me seems related to this issue:
{code:java}
ERROR org.apache.activemq.artemis.core.server
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@69242555)
[org.apache.activemq.artemis-server-osgi:2.20.0]
AMQ224013: failed to expire messages for queue
java.util.concurrent.TimeoutException: queueName
at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl$ExpiryReaper.run(PostOfficeImpl.java:1861)
~[?:?]
at
org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent.runForExecutor(ActiveMQScheduledComponent.java:313)
~[?:?]
at
org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent.lambda$bookedRunForScheduler$2(ActiveMQScheduledComponent.java:320)
~[?:?]
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
~[?:?]
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
~[?:?]
at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
~[?:?]
at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
~[?:?]{code}
´´
> Race condition in QueueImpl#expireReferences(java.lang.Runnable) can cause
> spurious AMQ224013 warnings to be emitted.
> ---------------------------------------------------------------------------------------------------------------------
>
> Key: ARTEMIS-3583
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3583
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.18.0
> Environment: Large-core machines running embedded ActiveMQ broker.
> Reporter: Anders Wallgren
> Priority: Major
>
> A recent change - [https://github.com/apache/activemq-artemis/pull/3654] -
> appears to introduce a race condition in QueueImpl#expireReferences(Runnable)
> that can cause the Runnable "done" parameter to never be invoked. This
> manifests itself as spurious "AMQ224013: failed to expire messages for queue"
> messages.
> The problem is this section of code and the non-atomic test-and-set on the
> expiryScanner.scannerRunning AtomicInteger:
> {code:java}
> if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) { <--- "test"
> if (expiryScanner.scannerRunning.incrementAndGet() == 1) { <--- "and-set"
> expiryScanner.doneCallback = done;
> }
> getExecutor().execute(expiryScanner);
> } else {
> // expire is already happening on this queue, move on!
> if (done != null) {
> done.run();
> }
> {code}
> Consider the following sequence of events and see if you agree with my
> analysis:
> # Two threads, t1 and t2
> # t1 runs PostOfficeImpl.ExpiryReaper which calls
> QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration
> is "done". It waits 10 seconds for the latch to drop. If the latch doesn't
> drop within the timeout it issues a AMQ224013 warning.
> # t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which
> in turn calls QueueImpl.expireReferences(null)
> # t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they
> both enter that branch. <--- This is where things start to go wrong, as only
> one of the two threads will successfully set expiryScanner.done; if the
> thread that "loses" is the one that supplies a non-null "done" parameter then
> that Runnable will never be invoked.
> # t2 successfully increments expiryScanner.scannerRunning to 1 and sets
> expiryScanner.doneCallback = null (since it passed null as the "done"
> parameter) and then continues to run expiryScanner.
> # t1 doesn't increment expiryScanner.scannerRunning and at this point the
> "done" argument is lost and will never be invoked, guaranteeing the AMQ224013
> warning. t1 then also runs the expiryScanner (which has already been
> submitted once by t2)
> Because of the non-atomic test-and-set, the callback is not guaranteed to be
> invoked and, also, more than one expiry run will execute simultaneously (this
> seems to be undesirable based on why this change was made in the first place).
> Should the code not be something like this (I'm not 100% familiar with the
> semantics of scannerRunning being > 1, so this may not be the correct
> solution):
>
> {code:java}
> if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) {
> <--- test-and-set
> expiryScanner.doneCallback = done;
> getExecutor().execute(expiryScanner);
> } else {
> // expire is already happening on this queue, move on!
> if (done != null) {
> done.run();
> }
> {code}
>
> This guarantees that only one thread can enter the section that sets
> expiryScanner.doneCallback and submits expiryScanner for execution; all other
> threads will immediately have their Runnable invoked.
> We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while
> testing on very large instances (64 CPUs and up) that are very busy.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)