[ 
https://issues.apache.org/jira/browse/ARTEMIS-5325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17941615#comment-17941615
 ] 

Justin Bertram commented on ARTEMIS-5325:
-----------------------------------------

[~jpbriquet], it is expected that the call to 
{{https://activemq.apache.org/components/artemis/documentation/latest/ha.html#replication}}
 will block any journal related operations (e.g. paging). This is specifically 
called out in [the 
documentation|https://activemq.apache.org/components/artemis/documentation/latest/ha.html#replication].
 That said, it should unblock via a timeout long before the critical analyzer 
kicks in. The default value of {{initial-replication-sync-timeout}} is 
{{30000}} (i.e. 30 seconds). In this situation it would be worth investigating 
why the backup isn't responding although there's only so much you can glean 
from a single thread dump. You really need several thread dumps to see what's 
happening _over time_. It's possible that replication synchronization is a red 
herring and the real problem lies elsewhere. I don't actually see anything in 
the thread dump that would be registered with the critical analyzer. Did you 
happen to gather any thread dumps of your own during this time?

There is still a bit of a mystery as to why management notifications are being 
paged to disk. Perhaps your broker exceeded its {{global-max-size}}?

> Don't block session creation/closing with sending management notification
> -------------------------------------------------------------------------
>
>                 Key: ARTEMIS-5325
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-5325
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker, Clustering
>    Affects Versions: 2.36.0, 2.37.0, 2.38.0, 2.39.0
>            Reporter: Jean-Pascal Briquet
>            Assignee: Justin Bertram
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: PrimaryDeadLockOnBackupSyncTest.java, 
> thread-dump-consumer-events.txt, thread-dump.txt
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h2. Configuration
> Artemis cluster with three primary/backup pairs using a ZooKeeper quorum.
> h2. Description
> The initial primary/backup replication can impact the primary (live) node, 
> causing it to crash or freeze for and extend period.
> After an in-depth investigation, I found that the primary becomes dead-locked 
> because no Netty threads are available to process the replication 
> synchronization confirmation coming from the backup.
> This issue occurs when client application creates too many connections during 
> the final phase of the replication phase.
> Below, I provide details of my investigation and a potential workaround.
> A thread-dump and a test-case are attached.
> h3. Lock / Unlock
> At the very end of the replication process, the Artemis primary locks its 
> internal state including journal. (see 
> ReplicationManager.sendSynchronizationDone()).
> It then waits for a synchronization confirmation packet from the backup 
> before releasing the lock (see ReplicationManager.handlePacket()).
> This confirmation packet indicates to the primary that the backup is 
> synchronized and ready for duty.
> The confirmation packet signals tha the backup is synchronized. While locked, 
> the primary is essentially frozen, no operation can proceed on the broker.
> Under normal circumstances, this locks lasts only a few seconds or less.
> However, in my scenario, the confirmation packet from the backup is never 
> processed.
> As a result, the primary remains locked indefinitely, freezing all activity 
> until the replication process times out or the Artemis critical analyzer 
> decides to stop the process.
> h3. Confirmation packet handling issue
> All incoming packets arriving to Artemis are handled by Netty threads, which 
> are managed via a dedicated Netty thread-pool of size = 3 * processor count.
> After adding low level logs in packet handlers and analyzing tcp dumps, I'm 
> sure that the confirmation packet is well received by the primary but is 
> never processed.
> Upon inspecting the thread-dump, it is possible to see that no free Artemis 
> Netty threads are available.
> All netty threads are blocked handling connection creation requests while 
> attempting to send session notification events to other cluster nodes. 
> However such notification event cannot be sent due to the replication and 
> journal lock.
> During the investigation, I have seen that some client application were 
> misbehaving, aggressively creating new connections.
> When these excessive connection requests occur in the final phase of the 
> initial replication, they can block all Netty threads, leading to the 
> deadlock.
> h2. Workaround
> Enable the following configuration in the broker.xml.
> {quote}<suppress-session-notifications>true</suppress-session-notifications>
> {quote}
> This property disable session creation notifications, preventing Netty 
> threads from being blocked and therefore avoiding the deadlock.
> https://activemq.apache.org/components/artemis/documentation/latest/management.html#suppressing-session-notifications
> Disabling session notification seems to be acceptable for my use-cases, which 
> relies on CORE, AMQP and OPENWIRE protocols.
> However, according to documentation, this option should not be used with MQTT 
> protocol.
> h2. Test
> Add the provided test under 
> tests/integration-tests/src/test/java/org/apache/activemq/artemis/tests/integration/cluster/failover



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
For further information, visit: https://activemq.apache.org/contact


Reply via email to