[jira] [Commented] (AMQNET-656) AMQP failover implementation fails to reconnect in some cases

Bruce Dodson (Jira) Fri, 13 Aug 2021 14:07:05 -0700


    [ 
https://issues.apache.org/jira/browse/AMQNET-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398916#comment-17398916
 ]


Bruce Dodson commented on AMQNET-656:
-------------------------------------

In follow-up, as per discussion in the PR, it appears the existing code does 
not violate rules for locks with async code after all, and could not be the 
explanation of the stall I observed. Thus, in theory, the Task.Run wrapper 
should not be needed. At best it is recognized as a brute force workaround that 
may or may not have made any difference.

I also noted that a bug in the underlying AMQPNetLite, now fixed as of version 
2.4.3, has been identified as the probable cause of the stall, due to dropped 
events to trigger failover / recovery: 
[Azure/amqpnetlite#460|https://github.com/Azure/amqpnetlite/issues/460] 
"NotifyClosed not being called on connection loss detected by HeartBeat..."

If so, the issue should no longer occur once 1.8.2 is released, as it will then 
reference AMQPNetLite 2.4.3+. Likewise, in theory, even before 1.8.2 is 
released, a workaround could be to reference AMQPNetLite 2.4.3 explicitly from 
NuGet.

I was not able to construct a test that would consistently reproduce the 
original issue. However, I will be switching one of our long-running test 
environments back from openwire to AMQP, with AMQPNetLite added as an explicit 
dependency and updated to 2.4.3, and hopefully will not see the stall again.

 

> AMQP failover implementation fails to reconnect in some cases
> -------------------------------------------------------------
>
>                 Key: AMQNET-656
>                 URL: https://issues.apache.org/jira/browse/AMQNET-656
>             Project: ActiveMQ .Net
>          Issue Type: Bug
>          Components: AMQP
>    Affects Versions: AMQP-1.8.1
>            Reporter: Bruce Dodson
>            Priority: Major
>          Time Spent: 8h
>  Remaining Estimate: 0h
>
> We recently had an issue where some of our producer instances were able to 
> reconnect immediately after the master/slave (or primary/standby) broker 
> cluster failed over, while others never reconnected.
> It appears to be related to two existing JIRAs, AMQNET-624 (addressed in 
> GitHub PR#45) and AMQNET-626 (new issue raised in the same PR, but closed 
> without any changes).
> Regarding the bug identified and fixed in AMQNET-624, part of the original 
> solution was pulled back: where TriggerReconnectionAttempt was called via 
> Task.Run, to instead call it directly. The second issue, AMQNET-626 was to 
> raise concern about the unawaited task returned by TriggerReconnectionAttempt.
> I think perhaps calling from Task.Run may have been beneficial after all: it 
> ensured that TriggerReconnectionAttempt was running on a thread from the 
> thread pool. Otherwise, when TriggerReconnectionAttempt calls 
> ScheduleReconnect, and ScheduleReconnect does an await, that is occurring 
> from within a lock statement.
> As noted in MSDN, an await statement _cannot_ occur inside of a lock 
> statement. That includes anywhere in the call stack, as far as I understand. 
> If you do, it is not caught by the compiler, but can lead to failures e.g. 
> where the task being awaited never gets scheduled.
> Invoking TriggerReconnectionAttempt from a thread pool thread (or another 
> background thread) is one way to avoid this issue, and using Task.Run() might 
> be the easiest way, even though it may also raise eyebrows. Any performance 
> overhead of Task.Run() shouldn't be a factor, since it is only invoked upon 
> losing connection, not continuously.
> The call to Task.Run() could also be moved into TriggerReconnectionAttempt, 
> like so:
> {code:java}
> // this is invoked using Task.Run, to ensure it runs on a thread pool thread
> // in case this was invoked from inside a lock statement (which it is)
> return Task.Run(async () => await 
> reconnectControl.ScheduleReconnect(Reconnect));{code}
> It does still leave the issue identified in AMQNET-626, that the result is 
> not checked, but it resolves the failover failure caused by calling await 
> inside of a lock.
> (Another way around this would be to use a SemaphoreSlim, or other 
> async-compatible synchronization mechanism instead of a lock statement. 
> However, that could have far-reaching implications, since lock statements are 
> used in many parts of the AMQP implementation.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (AMQNET-656) AMQP failover implementation fails to reconnect in some cases

Reply via email to