[ 
https://issues.apache.org/jira/browse/ARTEMIS-3583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anders Wallgren updated ARTEMIS-3583:
-------------------------------------
    Description: 
A recent change - [https://github.com/apache/activemq-artemis/pull/3654] - 
appears to introduce a race condition in QueueImpl#expireReferences(Runnable) 
that can cause the Runnable to never be invoked. This manifests itself as 
spurious "AMQ224013: failed to expire messages for queue" messages.

The problem is this section of code and the non-atomic test-and-set on the 
expiryScanner.scannerRunning AtomicInteger:

{{if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) {}}
{{  if (expiryScanner.scannerRunning.incrementAndGet() == 1) {}}
{{    expiryScanner.doneCallback = done;}}
{\{  }}}
{{  getExecutor().execute(expiryScanner);}}
{{} else {}}
{{  // expire is already happening on this queue, move on!}}
{{  if (done != null) {}}
{{    done.run();}}
{\{  }}}
{{}}}

Consider the following sequence of events and see if you agree with my analysis:
 # Two threads, t1 and t2
 # t1 runs PostOfficeImpl.ExpiryReaper which calls 
QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration is 
"done". It waits 10 seconds for the latch to drop. If the latch doesn't drop 
within the timeout it issues a AMQ224013 warning.
 # t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which in 
turn calls QueueImpl.expireReferences(null)
 # t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they both 
enter that branch.
 # t2 successfully increments expiryScanner.scannerRunning to 1 and sets 
expiryScanner.doneCallback = null (since it passed null as the "done" 
parameter) and then continues to run expiryScanner.
 # t1 doesn't increment expiryScanner.scannerRunning and at this point the 
"done" argument is lost and will never be invoked, guaranteeing the AMQ224013 
warning. t1 then also runs the expiryScanner (which has already been submitted 
once by t2)

Because of the non-atomic test-and-set, the callback is not guaranteed to be 
invoked and, also, more than one expiry run will execute simultaneously (this 
seems to be undesirable based on why this change was made in the first place).

Should the code not be something like this:

{{if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) {}}
{{  expiryScanner.doneCallback = done;}}
{{  getExecutor().execute(expiryScanner);}}
{{} else {}}
{{  // expire is already happening on this queue, move on!}}
{{  if (done != null) {}}
{{    done.run();}}
{\{  }}}
{{}}}

This guarantees that only one thread can enter the section that sets 
expiryScanner.doneCallback and submits expiryScanner for execution; all other 
threads will immediately have their Runnable invoked. 

We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while 
testing on very large instances (64 CPUs and up) that are very busy.

  was:
A recent change - [https://github.com/apache/activemq-artemis/pull/3654] - 
appears to introduce a race condition in QueueImpl#expireReferences(Runnable) 
that can cause the Runnable to never be invoked.. 

This manifests itself as spurious "AMQ224013: failed to expire messages for 
queue" messages.

The problem is this section of code and the non-atomic test-and-set on the 
expiryScanner.scannerRunning AtomicInteger:

{{if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) {}}
{{  if (expiryScanner.scannerRunning.incrementAndGet() == 1) {}}
{{    expiryScanner.doneCallback = done;}}
{{  }}}
{{  getExecutor().execute(expiryScanner);}}
{{} else {}}
{{  // expire is already happening on this queue, move on!}}
{{  if (done != null) {}}
{{    done.run();}}
{{  }}}
{{}}}

Consider the following sequence of events and see if you agree with my analysis:
 # Two threads, t1 and t2
 # t1 runs PostOfficeImpl.ExpiryReaper which calls 
QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration is 
"done". It waits 10 seconds for the latch to drop. If the latch doesn't drop 
within the timeout it issues a AMQ224013 warning.
 # t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which in 
turn calls QueueImpl.expireReferences(null)
 # t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they both 
enter that branch.
 # t2 successfully increments expiryScanner.scannerRunning to 1 and sets 
expiryScanner.doneCallback = null (since it passed null as the "done" 
parameter) and then continues to run expiryScanner.
 # t1 doesn't increment expiryScanner.scannerRunning and at this point the 
"done" argument is lost and will never be invoked, guaranteeing the AMQ224013 
warning. t1 then also runs the expiryScanner (which has already been submitted 
once by t2)

Because of the non-atomic test-and-set, the callback is not guaranteed to be 
invoked and, also, more than one expiry run will execute simultaneously (this 
seems to be undesirable based on why this change was made in the first place).

Should the code not be something like this:

{{if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) {}}
{{  expiryScanner.doneCallback = done;}}
{{  getExecutor().execute(expiryScanner);}}
{{} else {}}
{{  // expire is already happening on this queue, move on!}}
{{  if (done != null) {}}
{{    done.run();}}
{{  }}}
{{}}}

This guarantees that only one thread can enter the section that sets 
expiryScanner.doneCallback and submits expiryScanner for execution; all other 
threads will immediately have their Runnable invoked. 

We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while 
testing on very large instances (64 CPUs and up) that are very busy.


> Race condition in QueueImpl#expireReferences(java.lang.Runnable) can cause 
> spurious AMQ224013 warnings to be emitted.
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-3583
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3583
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.18.0
>         Environment: Large-core machines running embedded ActiveMQ broker.
>            Reporter: Anders Wallgren
>            Priority: Major
>
> A recent change - [https://github.com/apache/activemq-artemis/pull/3654] - 
> appears to introduce a race condition in QueueImpl#expireReferences(Runnable) 
> that can cause the Runnable to never be invoked. This manifests itself as 
> spurious "AMQ224013: failed to expire messages for queue" messages.
> The problem is this section of code and the non-atomic test-and-set on the 
> expiryScanner.scannerRunning AtomicInteger:
> {{if (!queueDestroyed && expiryScanner.scannerRunning.get() == 0) {}}
> {{  if (expiryScanner.scannerRunning.incrementAndGet() == 1) {}}
> {{    expiryScanner.doneCallback = done;}}
> {\{  }}}
> {{  getExecutor().execute(expiryScanner);}}
> {{} else {}}
> {{  // expire is already happening on this queue, move on!}}
> {{  if (done != null) {}}
> {{    done.run();}}
> {\{  }}}
> {{}}}
> Consider the following sequence of events and see if you agree with my 
> analysis:
>  # Two threads, t1 and t2
>  # t1 runs PostOfficeImpl.ExpiryReaper which calls 
> QueueImpl.expireReferences(latch::countDown) to drop a latch when expiration 
> is "done". It waits 10 seconds for the latch to drop. If the latch doesn't 
> drop within the timeout it issues a AMQ224013 warning.
>  # t2 calls QueueImpl.depage which calls QueueImpl.expireReferences(), which 
> in turn calls QueueImpl.expireReferences(null)
>  # t1 and t2 both see that expiryScanner.scannerRunning.get() == 0 so they 
> both enter that branch.
>  # t2 successfully increments expiryScanner.scannerRunning to 1 and sets 
> expiryScanner.doneCallback = null (since it passed null as the "done" 
> parameter) and then continues to run expiryScanner.
>  # t1 doesn't increment expiryScanner.scannerRunning and at this point the 
> "done" argument is lost and will never be invoked, guaranteeing the AMQ224013 
> warning. t1 then also runs the expiryScanner (which has already been 
> submitted once by t2)
> Because of the non-atomic test-and-set, the callback is not guaranteed to be 
> invoked and, also, more than one expiry run will execute simultaneously (this 
> seems to be undesirable based on why this change was made in the first place).
> Should the code not be something like this:
> {{if (!queueDestroyed && expiryScanner.scannerRunning.compareAndSet(0, 1)) {}}
> {{  expiryScanner.doneCallback = done;}}
> {{  getExecutor().execute(expiryScanner);}}
> {{} else {}}
> {{  // expire is already happening on this queue, move on!}}
> {{  if (done != null) {}}
> {{    done.run();}}
> {\{  }}}
> {{}}}
> This guarantees that only one thread can enter the section that sets 
> expiryScanner.doneCallback and submits expiryScanner for execution; all other 
> threads will immediately have their Runnable invoked. 
> We have seen these AMQ224013 warnings while trying to upgrade to 2.18.0 while 
> testing on very large instances (64 CPUs and up) that are very busy.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to