Anders Wallgren created ARTEMIS-3583:
----------------------------------------

             Summary: Race condition in 
QueueImpl#expireReferences(java.lang.Runnable) can cause spurious AMQ224013 
warnings to be emitted.
                 Key: ARTEMIS-3583
                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3583
             Project: ActiveMQ Artemis
          Issue Type: Bug
          Components: Broker
    Affects Versions: 2.18.0
         Environment: Large-core machines running embedded ActiveMQ broker.
            Reporter: Anders Wallgren


A recent change - [https://github.com/apache/activemq-artemis/pull/3654] - 
appears to introduce a race condition in QueueImpl#expireReferences(Runnable) 
that can cause the Runnable to never be invoked.. 

This manifests itself as spurious "AMQ224013: failed to expire messages for 
queue" messages.

The problem is this section of code and the non-atomic test-and-set on the 
expiryScanner.scannerRunning AtomicInteger:

{{if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) {}}
{{  if (expiryScanner.scannerRunning.incrementAndGet() == 1) {}}
{{    expiryScanner.doneCallback = done;}}
{{  }}}
{{  getExecutor().execute(expiryScanner);}}
{{} else {}}
{{  // expire is already happening on this queue, move on!}}
{{  if (done != null) {}}
{{    done.run();}}
{{  }}}
{{}}}

Consider the following sequence of events and see if you agree with my analysis:
 # Two threads, t1 and t2
 # t1 runs PostOfficeImpl.ExpiryReaper which calls 
QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration is 
"done". It waits 10 seconds for the latch to drop. If the latch doesn't drop 
within the timeout it issues a AMQ224013 warning.
 # t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which in 
turn calls QueueImpl.expireReferences(null)
 # t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they both 
enter that branch.
 # t2 successfully increments expiryScanner.scannerRunning to 1 and sets 
expiryScanner.doneCallback = null (since it passed null as the "done" 
parameter) and then continues to run expiryScanner.
 # t1 doesn't increment expiryScanner.scannerRunning and at this point the 
"done" argument is lost and will never be invoked, guaranteeing the AMQ224013 
warning. t1 then also runs the expiryScanner (which has already been submitted 
once by t2)

Because of the non-atomic test-and-set, the callback is not guaranteed to be 
invoked and, also, more than one expiry run will execute simultaneously (this 
seems to be undesirable based on why this change was made in the first place).

Should the code not be something like this:

{{if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) {}}
{{  expiryScanner.doneCallback = done;}}
{{  getExecutor().execute(expiryScanner);}}
{{} else {}}
{{  // expire is already happening on this queue, move on!}}
{{  if (done != null) {}}
{{    done.run();}}
{{  }}}
{{}}}

This guarantees that only one thread can enter the section that sets 
expiryScanner.doneCallback and submits expiryScanner for execution; all other 
threads will immediately have their Runnable invoked. 

We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while 
testing on very large instances (64 CPUs and up) that are very busy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to