[
https://issues.apache.org/jira/browse/ARTEMIS-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timothy A. Bish resolved ARTEMIS-5325.
--------------------------------------
Fix Version/s: 2.41.0
Resolution: Fixed
> Don't block session creation/closing with sending management notification
> -------------------------------------------------------------------------
>
> Key: ARTEMIS-5325
> URL: https://issues.apache.org/jira/browse/ARTEMIS-5325
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker, Clustering
> Affects Versions: 2.36.0, 2.37.0, 2.38.0, 2.39.0
> Reporter: Jean-Pascal Briquet
> Assignee: Justin Bertram
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.41.0
>
> Attachments: PrimaryDeadLockOnBackupSyncTest.java,
> thread-dump-consumer-events.txt, thread-dump.txt
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> h2. Configuration
> Artemis cluster with three primary/backup pairs using a ZooKeeper quorum.
> h2. Description
> The initial primary/backup replication can impact the primary (live) node,
> causing it to crash or freeze for and extend period.
> After an in-depth investigation, I found that the primary becomes dead-locked
> because no Netty threads are available to process the replication
> synchronization confirmation coming from the backup.
> This issue occurs when client application creates too many connections during
> the final phase of the replication phase.
> Below, I provide details of my investigation and a potential workaround.
> A thread-dump and a test-case are attached.
> h3. Lock / Unlock
> At the very end of the replication process, the Artemis primary locks its
> internal state including journal. (see
> ReplicationManager.sendSynchronizationDone()).
> It then waits for a synchronization confirmation packet from the backup
> before releasing the lock (see ReplicationManager.handlePacket()).
> This confirmation packet indicates to the primary that the backup is
> synchronized and ready for duty.
> The confirmation packet signals tha the backup is synchronized. While locked,
> the primary is essentially frozen, no operation can proceed on the broker.
> Under normal circumstances, this locks lasts only a few seconds or less.
> However, in my scenario, the confirmation packet from the backup is never
> processed.
> As a result, the primary remains locked indefinitely, freezing all activity
> until the replication process times out or the Artemis critical analyzer
> decides to stop the process.
> h3. Confirmation packet handling issue
> All incoming packets arriving to Artemis are handled by Netty threads, which
> are managed via a dedicated Netty thread-pool of size = 3 * processor count.
> After adding low level logs in packet handlers and analyzing tcp dumps, I'm
> sure that the confirmation packet is well received by the primary but is
> never processed.
> Upon inspecting the thread-dump, it is possible to see that no free Artemis
> Netty threads are available.
> All netty threads are blocked handling connection creation requests while
> attempting to send session notification events to other cluster nodes.
> However such notification event cannot be sent due to the replication and
> journal lock.
> During the investigation, I have seen that some client application were
> misbehaving, aggressively creating new connections.
> When these excessive connection requests occur in the final phase of the
> initial replication, they can block all Netty threads, leading to the
> deadlock.
> h2. Workaround
> Enable the following configuration in the broker.xml.
> {quote}<suppress-session-notifications>true</suppress-session-notifications>
> {quote}
> This property disable session creation notifications, preventing Netty
> threads from being blocked and therefore avoiding the deadlock.
> https://activemq.apache.org/components/artemis/documentation/latest/management.html#suppressing-session-notifications
> Disabling session notification seems to be acceptable for my use-cases, which
> relies on CORE, AMQP and OPENWIRE protocols.
> However, according to documentation, this option should not be used with MQTT
> protocol.
> h2. Test
> Add the provided test under
> tests/integration-tests/src/test/java/org/apache/activemq/artemis/tests/integration/cluster/failover
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact