[
https://issues.apache.org/jira/browse/AMQ-9107?focusedWorklogId=823752&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-823752
]
ASF GitHub Bot logged work on AMQ-9107:
---------------------------------------
Author: ASF GitHub Bot
Created on: 06/Nov/22 16:12
Start Date: 06/Nov/22 16:12
Worklog Time Spent: 10m
Work Description: cshannon commented on PR #928:
URL: https://github.com/apache/activemq/pull/928#issuecomment-1304835694
@lucastetreault - The main issue here was the memory leak with durable
subscriptions closing and coming back online with new consumers. This could
have been fixed of course and your original fix probably would have been ok to
keep. But besides the issue with the memory leak, I went with a different
approach for 1 main reason which I thought was important and a secondary reason.
1. The primary reason I decided to go with the new approach is my
implementation now avoids having to track duplicate metadata. Since the
information is already tracked in the `ManagedRegionBroker` (since it extends
`RegionBroker`) in the different `Regions` objects and well tested for a long
time it made more sense to me to just use the existing maps so we don't have to
worry about the extra logic of trying to track new maps/metadata and verifying
things will work and are kept in sync since it's already there and tested. I
saw zero benefit of attempting to track something again that we already track
and can easily look up and only drawbacks in terms of now something else to
keep in sync and update and things like memory leaks which happened.
2. A more minor secondary thing was you made the assumption that there is
always a 1 to 1 mapping of ConsumerInfo to subscriptions. I think this is minor
because this is usually the case (and I'd have to check but I think is always
be the case now) even with composite destinations which is why the composite
destination test works since it shares a subscription. However, in theory, this
doesn't have to be which is why the original code was looking through all
subscriptions for a matching ConsumerInfo and my approach preserves that by
finding all subscriptions that match in the different Regions. The Regions
implementation uses a 1 to 1 mapping of consumer id to subs but it also checks
each region and if for some reason the code in the different regions ever
associated more than one Consumer to subscription the new approach would still
work.
In regards to the tests passing, it's because the main issue was the memory
leak and my test class of course doesn't try and check the map that had the
memory leak (`consumerSubscriptionMap`) since it no longer exists in my PR. If
I run my test against your original fix and update the
`testDurableConsumerRemoval` test to check that `consumerSubscriptionMap ` is
size zero at the end after all consumers close (which it should be), it's not
and is still 1000 and the test fails. I assume of course it would be fine if
you fixed that memory leak.
Issue Time Tracking
-------------------
Worklog Id: (was: 823752)
Time Spent: 3h 40m (was: 3.5h)
> Closing many consumers causes CPU to spike to 100%
> --------------------------------------------------
>
> Key: AMQ-9107
> URL: https://issues.apache.org/jira/browse/AMQ-9107
> Project: ActiveMQ
> Issue Type: Bug
> Affects Versions: 5.17.1, 5.16.5
> Reporter: Lucas Tétreault
> Assignee: Jean-Baptiste Onofré
> Priority: Major
> Fix For: 5.18.0, 5.16.6, 5.17.3
>
> Attachments: example.zip, image-2022-10-07-00-12-39-657.png,
> image-2022-10-07-00-17-30-657.png
>
> Time Spent: 3h 40m
> Remaining Estimate: 0h
>
> When there are many consumers (~188k) on a queue, closing them is incredibly
> expensive and causes the CPU to spike to 100% while the consumers are closed.
> Tested on an Amazon MQ mq.m5.large instance (2 vcpu, 8gb memory).
> I have attached a minimal recreation of the issue where the following
> happens:
> 1/ Open 100 connections.
> 2/ Create consumers as fast as we can on all of those connections until we
> hit at least 188k consumers.
> 3/ Sleep for 5 minutes so we can observe the CPU come back down after opening
> all those connections.
> 4/ Start closing consumers as fast as we can.
> 5/ After all consumers are closed, sleep for 5 minutes to observe the CPU
> come back down after closing all the connections.
>
> In this example it seems 5 minutes wasn't actually sufficient time for the
> CPU to come back down and the consumer and connection counts seem to hit 0 at
> the same time:
> !image-2022-10-07-00-12-39-657.png|width=757,height=353!
>
> In a previous test with more time sleeping after closing all the consumers we
> can see the CPU come back down before we close the connections.
> !image-2022-10-07-00-17-30-657.png|width=764,height=348!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)