[
https://issues.apache.org/jira/browse/ARTEMIS-4527?focusedWorklogId=894820&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-894820
]
ASF GitHub Bot logged work on ARTEMIS-4527:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 08/Dec/23 20:18
Start Date: 08/Dec/23 20:18
Worklog Time Spent: 10m
Work Description: AntonRoskvist commented on PR #4705:
URL:
https://github.com/apache/activemq-artemis/pull/4705#issuecomment-1847792682
No, I have only been able to get an idea of what happens after the fact...
the window of opportunity for this to happen is really slim... In fact, early
in my troubleshooting I tried to add logging in Postoffice and the
Clusterconnection for the notifications but doing so seemingly added enough of
a delay to not trigger the issue (at least in the setup i used to reproduce,
it's possible it would happen given different run values for the reproducer).
From what I can gather at least, locally everything happens in the correct
order. Local counters have always been correct.
My **guess** would be that in some circumstance the servers `createQueue()`
can take some time to finish, such that it allows a binding to get added, but
before the BINDING_ADDED notification is sent, a call to the ServerConsumers
`createConsumer()` is issued... this call requires no synchronization on
Postoffice (as far as I can tell) and so its able to finish (and send its
notification) before the servers `createQueue()` finishes all the way and sends
its own notification.
So... my assumption is that something along those lines are causing this,
which is why I added synchronization on Postoffice before allowing
`createConsumer()` to send its notification (since it's `addBinding()` in
Postoffice that sends the BINDING_ADDED notification). After making that change
I have been unable to reproduce the issue again.
If it where to happen again though, the changes made in
`RemoteQueueBindingImpl` should stop the redistributor from causing any issues
regardless, but I'd much rather understand everything that's going on here for
sure, if nothing else to be able to write a better reproducer for this...
Issue Time Tracking
-------------------
Worklog Id: (was: 894820)
Time Spent: 50m (was: 40m)
> Redistributor race when consumerCount reaches 0 in cluster
> ----------------------------------------------------------
>
> Key: ARTEMIS-4527
> URL: https://issues.apache.org/jira/browse/ARTEMIS-4527
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Reporter: Anton Roskvist
> Priority: Major
> Time Spent: 50m
> Remaining Estimate: 0h
>
> This is a very rare bug caused by cluster notifications arriving in the wrong
> order in some very specific circumstances
--
This message was sent by Atlassian Jira
(v8.20.10#820010)