[ 
https://issues.apache.org/jira/browse/ARTEMIS-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy A. Bish resolved ARTEMIS-5325.
--------------------------------------
    Fix Version/s: 2.41.0
       Resolution: Fixed

> Don't block session creation/closing with sending management notification
> -------------------------------------------------------------------------
>
>                 Key: ARTEMIS-5325
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-5325
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker, Clustering
>    Affects Versions: 2.36.0, 2.37.0, 2.38.0, 2.39.0
>            Reporter: Jean-Pascal Briquet
>            Assignee: Justin Bertram
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.41.0
>
>         Attachments: PrimaryDeadLockOnBackupSyncTest.java, 
> thread-dump-consumer-events.txt, thread-dump.txt
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> h2. Configuration
> Artemis cluster with three primary/backup pairs using a ZooKeeper quorum.
> h2. Description
> The initial primary/backup replication can impact the primary (live) node, 
> causing it to crash or freeze for and extend period.
> After an in-depth investigation, I found that the primary becomes dead-locked 
> because no Netty threads are available to process the replication 
> synchronization confirmation coming from the backup.
> This issue occurs when client application creates too many connections during 
> the final phase of the replication phase.
> Below, I provide details of my investigation and a potential workaround.
> A thread-dump and a test-case are attached.
> h3. Lock / Unlock
> At the very end of the replication process, the Artemis primary locks its 
> internal state including journal. (see 
> ReplicationManager.sendSynchronizationDone()).
> It then waits for a synchronization confirmation packet from the backup 
> before releasing the lock (see ReplicationManager.handlePacket()).
> This confirmation packet indicates to the primary that the backup is 
> synchronized and ready for duty.
> The confirmation packet signals tha the backup is synchronized. While locked, 
> the primary is essentially frozen, no operation can proceed on the broker.
> Under normal circumstances, this locks lasts only a few seconds or less.
> However, in my scenario, the confirmation packet from the backup is never 
> processed.
> As a result, the primary remains locked indefinitely, freezing all activity 
> until the replication process times out or the Artemis critical analyzer 
> decides to stop the process.
> h3. Confirmation packet handling issue
> All incoming packets arriving to Artemis are handled by Netty threads, which 
> are managed via a dedicated Netty thread-pool of size = 3 * processor count.
> After adding low level logs in packet handlers and analyzing tcp dumps, I'm 
> sure that the confirmation packet is well received by the primary but is 
> never processed.
> Upon inspecting the thread-dump, it is possible to see that no free Artemis 
> Netty threads are available.
> All netty threads are blocked handling connection creation requests while 
> attempting to send session notification events to other cluster nodes. 
> However such notification event cannot be sent due to the replication and 
> journal lock.
> During the investigation, I have seen that some client application were 
> misbehaving, aggressively creating new connections.
> When these excessive connection requests occur in the final phase of the 
> initial replication, they can block all Netty threads, leading to the 
> deadlock.
> h2. Workaround
> Enable the following configuration in the broker.xml.
> {quote}<suppress-session-notifications>true</suppress-session-notifications>
> {quote}
> This property disable session creation notifications, preventing Netty 
> threads from being blocked and therefore avoiding the deadlock.
> https://activemq.apache.org/components/artemis/documentation/latest/management.html#suppressing-session-notifications
> Disabling session notification seems to be acceptable for my use-cases, which 
> relies on CORE, AMQP and OPENWIRE protocols.
> However, according to documentation, this option should not be used with MQTT 
> protocol.
> h2. Test
> Add the provided test under 
> tests/integration-tests/src/test/java/org/apache/activemq/artemis/tests/integration/cluster/failover



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact


Reply via email to