[ 
https://issues.apache.org/jira/browse/ARTEMIS-4527?focusedWorklogId=894820&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-894820
 ]

ASF GitHub Bot logged work on ARTEMIS-4527:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Dec/23 20:18
            Start Date: 08/Dec/23 20:18
    Worklog Time Spent: 10m 
      Work Description: AntonRoskvist commented on PR #4705:
URL: 
https://github.com/apache/activemq-artemis/pull/4705#issuecomment-1847792682

   No, I have only been able to get an idea of what happens after the fact... 
the window of opportunity for this to happen is really slim... In fact, early 
in my troubleshooting I tried to add logging in Postoffice and the 
Clusterconnection for the notifications but doing so seemingly added enough of 
a delay to not trigger the issue (at least in the setup i used to reproduce, 
it's possible it would happen given different run values for the reproducer).
   
   From what I can gather at least, locally everything happens in the correct 
order. Local counters have always been correct.
   
   My **guess** would be that in some circumstance the servers `createQueue()` 
can take some time to finish, such that it allows a binding to get added, but 
before the BINDING_ADDED notification is sent, a call to the ServerConsumers 
`createConsumer()` is issued... this call requires no synchronization on 
Postoffice (as far as I can tell) and so its able to finish (and send its 
notification) before the servers `createQueue()` finishes all the way and sends 
its own notification.
   
   So... my assumption is that something along those lines are causing this, 
which is why I added synchronization on Postoffice before allowing 
`createConsumer()` to send its notification (since it's `addBinding()` in 
Postoffice that sends the BINDING_ADDED notification). After making that change 
I have been unable to reproduce the issue again.
   
   If it where to happen again though, the changes made in 
`RemoteQueueBindingImpl` should stop the redistributor from causing any issues 
regardless, but I'd much rather understand everything that's going on here for 
sure, if nothing else to be able to write a better reproducer for this...




Issue Time Tracking
-------------------

    Worklog Id:     (was: 894820)
    Time Spent: 50m  (was: 40m)

> Redistributor race when consumerCount reaches 0 in cluster
> ----------------------------------------------------------
>
>                 Key: ARTEMIS-4527
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4527
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>            Reporter: Anton Roskvist
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> This is a very rare bug caused by cluster notifications arriving in the wrong 
> order in some very specific circumstances



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to