[jira] [Work logged] (AMQ-9107) Closing many consumers causes CPU to spike to 100%

ASF GitHub Bot (Jira) Sun, 06 Nov 2022 08:13:06 -0800


     [ 
https://issues.apache.org/jira/browse/AMQ-9107?focusedWorklogId=823752&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-823752
 ]


ASF GitHub Bot logged work on AMQ-9107:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Nov/22 16:12
            Start Date: 06/Nov/22 16:12
    Worklog Time Spent: 10m 
      Work Description: cshannon commented on PR #928:
URL: https://github.com/apache/activemq/pull/928#issuecomment-1304835694

   @lucastetreault  - The main issue here was the memory leak with durable 
subscriptions closing and coming back online with new consumers. This could 
have been fixed of course and your original fix probably would have been ok to 
keep. But besides the issue with the memory leak, I went with a different 
approach for 1 main reason which I thought was important and a secondary reason.
   
   1. The primary reason I decided to go with the new approach is my 
implementation now avoids having to track duplicate metadata. Since the 
information is already tracked in the `ManagedRegionBroker` (since it extends 
`RegionBroker`) in the different `Regions` objects and well tested for a long 
time it made more sense to me to just use the existing maps so we don't have to 
worry about the extra logic of trying to track new maps/metadata and verifying 
things will work and are kept in sync since it's already there and tested. I 
saw zero benefit of attempting to track something again that we already track 
and can easily look up and only drawbacks in terms of now something else to 
keep in sync and update and things like memory leaks which happened.
   2. A more minor secondary thing was you made the assumption that there is 
always a 1 to 1 mapping of ConsumerInfo to subscriptions. I think this is minor 
because this is usually the case (and I'd have to check but I think is always 
be the case now) even with composite destinations which is why the composite 
destination test works since it shares a subscription. However, in theory, this 
doesn't have to be which is why the original code was looking through all 
subscriptions for a matching ConsumerInfo and my approach preserves that by 
finding all subscriptions that match in the different Regions. The Regions 
implementation uses a 1 to 1 mapping of consumer id to subs but it also checks 
each region and if for some reason the code in the different regions ever 
associated more than one Consumer to subscription the new approach would still 
work.
   
   In regards to the tests passing, it's because the main issue was the memory 
leak and my test class of course doesn't try and check the map that had the 
memory leak (`consumerSubscriptionMap`) since it no longer exists in my PR. If 
I run my test against your original fix and update the 
`testDurableConsumerRemoval` test to check that `consumerSubscriptionMap ` is 
size zero at the end after all consumers close (which it should be), it's not 
and is still 1000 and the test fails. I assume of course it would be fine if 
you fixed that memory leak.




Issue Time Tracking
-------------------

    Worklog Id:     (was: 823752)
    Time Spent: 3h 40m  (was: 3.5h)

> Closing many consumers causes CPU to spike to 100%
> --------------------------------------------------
>
>                 Key: AMQ-9107
>                 URL: https://issues.apache.org/jira/browse/AMQ-9107
>             Project: ActiveMQ
>          Issue Type: Bug
>    Affects Versions: 5.17.1, 5.16.5
>            Reporter: Lucas Tétreault
>            Assignee: Jean-Baptiste Onofré
>            Priority: Major
>             Fix For: 5.18.0, 5.16.6, 5.17.3
>
>         Attachments: example.zip, image-2022-10-07-00-12-39-657.png, 
> image-2022-10-07-00-17-30-657.png
>
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> When there are many consumers (~188k) on a queue, closing them is incredibly 
> expensive and causes the CPU to spike to 100% while the consumers are closed. 
> Tested on an Amazon MQ mq.m5.large instance (2 vcpu, 8gb memory).
> I have attached a minimal recreation of the issue where the following 
> happens: 
> 1/ Open 100 connections.
> 2/ Create consumers as fast as we can on all of those connections until we 
> hit at least 188k consumers.
> 3/ Sleep for 5 minutes so we can observe the CPU come back down after opening 
> all those connections.
> 4/ Start closing consumers as fast as we can.
> 5/ After all consumers are closed, sleep for 5 minutes to observe the CPU 
> come back down after closing all the connections.
>  
> In this example it seems 5 minutes wasn't actually sufficient time for the 
> CPU to come back down and the consumer and connection counts seem to hit 0 at 
> the same time: 
> !image-2022-10-07-00-12-39-657.png|width=757,height=353!
>  
> In a previous test with more time sleeping after closing all the consumers we 
> can see the CPU come back down before we close the connections. 
> !image-2022-10-07-00-17-30-657.png|width=764,height=348!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (AMQ-9107) Closing many consumers causes CPU to spike to 100%

Reply via email to