[
https://issues.apache.org/jira/browse/ARTEMIS-4527?focusedWorklogId=894308&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-894308
]
ASF GitHub Bot logged work on ARTEMIS-4527:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 06/Dec/23 14:20
Start Date: 06/Dec/23 14:20
Worklog Time Spent: 10m
Work Description: AntonRoskvist opened a new pull request, #4705:
URL: https://github.com/apache/activemq-artemis/pull/4705
…ster
This is a very rare bug but when triggered, messages in the queue with 0
consumers will have the redistributors loop messages between some or all
brokers in a cluster as fast as they can manage, until either some system
resource or the clusterBridges producerFlowControl is reached. Will keep
happening until consumers are added or cluster bridges are restarted.
I don't have a test for this but instead added a reproducer that works with
a considerable amount of tweaks. Comments in the reproducer explains how to run
it. The reproducer is _not_ a valid or reasonable use case... it builds on some
unrelated work I did that accidentally triggered this. I have seen this
multiple times in a production environment over the course of several years
though, I've just been unable to reproduce it outside of production before
accidentally stumbling on it recently.
Problem occurs when CONSUMER_CREATED notification arrive before the
BINDING_ADDED notification.
When that happens the consumerCount for RemoteBinding is incorrect
(something like 1-2 lower than actual consumerCount value).
Then when consumers disconnect, all are registered properly and
RemoteBinding gets a negative consumerCount. The `isHighAcceptPriority` used by
the redistributor checks for consumerCount == 0 but since count is now negative
it returns as a valid destination.
Fix adds synchronization on the postoffice when processing createConsumer so
then the previously issued addBinding for sure is done before continuing.
I also added double safety in the RemoteQueueBinding by not lowering
consumerCount below 0 and also checking for consumerCount <= 0 instead of
consumerCount == 0, though neither of these should really be necessary if the
cluster notifications always arrive in the correct order.
If anyone can figure out a consistent way to trigger this issue I'd be happy
to add it. Regardless, if the changes look good otherwise I think the
reproducer should be removed rather than merged, leaving it here for
verification purposes.
One final though is that perhaps all of the sort of create/add/remove
operations in the
`org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler`
should be synchronized?
Something building on the current pattern of:
`onMessagePacket()`
```
switch
fast1:
fast1Stuff();
fast2:
fast2Stuff();
default:
slow()
slow:
switch
slow1:
slow1Stuff();
slow2:
slow2Stuff();
default:
synchronizedStuff()
synchronizedStuff:
switch
...
...
```
Issue Time Tracking
-------------------
Worklog Id: (was: 894308)
Remaining Estimate: 0h
Time Spent: 10m
> Redistributor race when consumerCount reaches 0 in cluster
> ----------------------------------------------------------
>
> Key: ARTEMIS-4527
> URL: https://issues.apache.org/jira/browse/ARTEMIS-4527
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Reporter: Anton Roskvist
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This is a very rare bug caused by cluster notifications arriving in the wrong
> order in some very specific circumstances
--
This message was sent by Atlassian Jira
(v8.20.10#820010)