[ 
https://issues.apache.org/jira/browse/ARTEMIS-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477232#comment-17477232
 ] 

Rico Neubauer commented on ARTEMIS-3583:
----------------------------------------

Sporadically getting error below, which to me seems related to this issue:
{code:java}
ERROR     org.apache.activemq.artemis.core.server     
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@69242555)
     [org.apache.activemq.artemis-server-osgi:2.20.0]                         
AMQ224013: failed to expire messages for queue
java.util.concurrent.TimeoutException: queueName
    at 
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl$ExpiryReaper.run(PostOfficeImpl.java:1861)
 ~[?:?]
    at 
org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent.runForExecutor(ActiveMQScheduledComponent.java:313)
 ~[?:?]
    at 
org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent.lambda$bookedRunForScheduler$2(ActiveMQScheduledComponent.java:320)
 ~[?:?]
    at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
 ~[?:?]
    at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
 ~[?:?]
    at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
 ~[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) 
~[?:?]
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) 
~[?:?]
    at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
 ~[?:?]{code}
´´

> Race condition in QueueImpl#expireReferences(java.lang.Runnable) can cause 
> spurious AMQ224013 warnings to be emitted.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-3583
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3583
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.18.0
>         Environment: Large-core machines running embedded ActiveMQ broker.
>            Reporter: Anders Wallgren
>            Priority: Major
>
> A recent change - [https://github.com/apache/activemq-artemis/pull/3654] - 
> appears to introduce a race condition in QueueImpl#expireReferences(Runnable) 
> that can cause the Runnable "done" parameter to never be invoked. This 
> manifests itself as spurious "AMQ224013: failed to expire messages for queue" 
> messages.
> The problem is this section of code and the non-atomic test-and-set on the 
> expiryScanner.scannerRunning AtomicInteger:
> {code:java}
> if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) {  <--- "test"
>   if (expiryScanner.scannerRunning.incrementAndGet() == 1) { <--- "and-set"
>     expiryScanner.doneCallback = done;
>   }
>   getExecutor().execute(expiryScanner);
> } else {
>   // expire is already happening on this queue, move on!
>   if (done != null) {
>     done.run();
> }
> {code}
> Consider the following sequence of events and see if you agree with my 
> analysis:
>  # Two threads, t1 and t2
>  # t1 runs PostOfficeImpl.ExpiryReaper which calls 
> QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration 
> is "done". It waits 10 seconds for the latch to drop. If the latch doesn't 
> drop within the timeout it issues a AMQ224013 warning.
>  # t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which 
> in turn calls QueueImpl.expireReferences(null)
>  # t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they 
> both enter that branch. <--- This is where things start to go wrong, as only 
> one of the two threads will successfully set expiryScanner.done; if the 
> thread that "loses" is the one that supplies a non-null "done" parameter then 
> that Runnable will never be invoked.
>  # t2 successfully increments expiryScanner.scannerRunning to 1 and sets 
> expiryScanner.doneCallback = null (since it passed null as the "done" 
> parameter) and then continues to run expiryScanner.
>  # t1 doesn't increment expiryScanner.scannerRunning and at this point the 
> "done" argument is lost and will never be invoked, guaranteeing the AMQ224013 
> warning. t1 then also runs the expiryScanner (which has already been 
> submitted once by t2)
> Because of the non-atomic test-and-set, the callback is not guaranteed to be 
> invoked and, also, more than one expiry run will execute simultaneously (this 
> seems to be undesirable based on why this change was made in the first place).
> Should the code not be something like this (I'm not 100% familiar with the 
> semantics of scannerRunning being > 1, so this may not be the correct 
> solution):
>  
> {code:java}
> if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) { 
> <--- test-and-set
>   expiryScanner.doneCallback = done;
>   getExecutor().execute(expiryScanner);
> } else {
>   // expire is already happening on this queue, move on!
>   if (done != null) {
>     done.run();
> }
> {code}
>  
> This guarantees that only one thread can enter the section that sets 
> expiryScanner.doneCallback and submits expiryScanner for execution; all other 
> threads will immediately have their Runnable invoked. 
> We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while 
> testing on very large instances (64 CPUs and up) that are very busy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to