[
https://issues.apache.org/jira/browse/AMQNET-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398916#comment-17398916
]
Bruce Dodson commented on AMQNET-656:
-------------------------------------
In follow-up, as per discussion in the PR, it appears the existing code does
not violate rules for locks with async code after all, and could not be the
explanation of the stall I observed. Thus, in theory, the Task.Run wrapper
should not be needed. At best it is recognized as a brute force workaround that
may or may not have made any difference.
I also noted that a bug in the underlying AMQPNetLite, now fixed as of version
2.4.3, has been identified as the probable cause of the stall, due to dropped
events to trigger failover / recovery:
[Azure/amqpnetlite#460|https://github.com/Azure/amqpnetlite/issues/460]
"NotifyClosed not being called on connection loss detected by HeartBeat..."
If so, the issue should no longer occur once 1.8.2 is released, as it will then
reference AMQPNetLite 2.4.3+. Likewise, in theory, even before 1.8.2 is
released, a workaround could be to reference AMQPNetLite 2.4.3 explicitly from
NuGet.
I was not able to construct a test that would consistently reproduce the
original issue. However, I will be switching one of our long-running test
environments back from openwire to AMQP, with AMQPNetLite added as an explicit
dependency and updated to 2.4.3, and hopefully will not see the stall again.
> AMQP failover implementation fails to reconnect in some cases
> -------------------------------------------------------------
>
> Key: AMQNET-656
> URL: https://issues.apache.org/jira/browse/AMQNET-656
> Project: ActiveMQ .Net
> Issue Type: Bug
> Components: AMQP
> Affects Versions: AMQP-1.8.1
> Reporter: Bruce Dodson
> Priority: Major
> Time Spent: 8h
> Remaining Estimate: 0h
>
> We recently had an issue where some of our producer instances were able to
> reconnect immediately after the master/slave (or primary/standby) broker
> cluster failed over, while others never reconnected.
> It appears to be related to two existing JIRAs, AMQNET-624 (addressed in
> GitHub PR#45) and AMQNET-626 (new issue raised in the same PR, but closed
> without any changes).
> Regarding the bug identified and fixed in AMQNET-624, part of the original
> solution was pulled back: where TriggerReconnectionAttempt was called via
> Task.Run, to instead call it directly. The second issue, AMQNET-626 was to
> raise concern about the unawaited task returned by TriggerReconnectionAttempt.
> I think perhaps calling from Task.Run may have been beneficial after all: it
> ensured that TriggerReconnectionAttempt was running on a thread from the
> thread pool. Otherwise, when TriggerReconnectionAttempt calls
> ScheduleReconnect, and ScheduleReconnect does an await, that is occurring
> from within a lock statement.
> As noted in MSDN, an await statement _cannot_ occur inside of a lock
> statement. That includes anywhere in the call stack, as far as I understand.
> If you do, it is not caught by the compiler, but can lead to failures e.g.
> where the task being awaited never gets scheduled.
> Invoking TriggerReconnectionAttempt from a thread pool thread (or another
> background thread) is one way to avoid this issue, and using Task.Run() might
> be the easiest way, even though it may also raise eyebrows. Any performance
> overhead of Task.Run() shouldn't be a factor, since it is only invoked upon
> losing connection, not continuously.
> The call to Task.Run() could also be moved into TriggerReconnectionAttempt,
> like so:
> {code:java}
> // this is invoked using Task.Run, to ensure it runs on a thread pool thread
> // in case this was invoked from inside a lock statement (which it is)
> return Task.Run(async () => await
> reconnectControl.ScheduleReconnect(Reconnect));{code}
> It does still leave the issue identified in AMQNET-626, that the result is
> not checked, but it resolves the failover failure caused by calling await
> inside of a lock.
> (Another way around this would be to use a SemaphoreSlim, or other
> async-compatible synchronization mechanism instead of a lock statement.
> However, that could have far-reaching implications, since lock statements are
> used in many parts of the AMQP implementation.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)