[jira] [Work logged] (AMQNET-656) AMQP failover implementation fails to reconnect in some cases

ASF GitHub Bot (Jira) Tue, 16 Feb 2021 15:02:07 -0800


     [ 
https://issues.apache.org/jira/browse/AMQNET-656?focusedWorklogId=553238&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-553238
 ]


ASF GitHub Bot logged work on AMQNET-656:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Feb/21 23:01
            Start Date: 16/Feb/21 23:01
    Worklog Time Spent: 10m 
      Work Description: Havret commented on pull request #59:
URL: https://github.com/apache/activemq-nms-amqp/pull/59#issuecomment-780173091


   I'm not sure if this solves anything. Currently you will leave lock on first 
await. With the proposed fix you will leave it instantly. In either case lock 
doesn't work as it should. I'm not sure why 
https://issues.apache.org/jira/browse/AMQNET-626 was closed, so I reopened it. 
   
   As I said in https://github.com/apache/activemq-nms-amqp/pull/45 
   
   > I think we should revisit this part of code (mainly 
TriggerReconnectionAttempt) after we hopefully deliver the initial release. 
Calling async method without await or GetResult() is always a red flag.
   
   This was a bit too naively ported from Java and definitely requires a second 
look. A more reliable implementation might look as follows: 
https://github.com/Havret/dotnet-activemq-artemis-client/blob/master/src/ActiveMQ.Artemis.Client/AutoRecovering/AutoRecoveringConnection.cs


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 553238)
    Time Spent: 1.5h  (was: 1h 20m)

> AMQP failover implementation fails to reconnect in some cases
> -------------------------------------------------------------
>
>                 Key: AMQNET-656
>                 URL: https://issues.apache.org/jira/browse/AMQNET-656
>             Project: ActiveMQ .Net
>          Issue Type: Bug
>          Components: AMQP
>    Affects Versions: AMQP-1.8.1
>            Reporter: Bruce Dodson
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> We recently had an issue where some of our producer instances were able to 
> reconnect immediately after the master/slave (or primary/standby) broker 
> cluster failed over, while others never reconnected.
> It appears to be related to two existing JIRAs, AMQNET-624 (addressed in 
> GitHub PR#45) and AMQNET-626 (new issue raised in the same PR, but closed 
> without any changes).
> Regarding the bug identified and fixed in AMQNET-624, part of the original 
> solution was pulled back: where TriggerReconnectionAttempt was called via 
> Task.Run, to instead call it directly. The second issue, AMQNET-626 was to 
> raise concern about the unawaited task returned by TriggerReconnectionAttempt.
> I think perhaps calling from Task.Run may have been beneficial after all: it 
> ensured that TriggerReconnectionAttempt was running on a thread from the 
> thread pool. Otherwise, when TriggerReconnectionAttempt calls 
> ScheduleReconnect, and ScheduleReconnect does an await, that is occurring 
> from within a lock statement.
> As noted in MSDN, an await statement _cannot_ occur inside of a lock 
> statement. That includes anywhere in the call stack, as far as I understand. 
> If you do, it is not caught by the compiler, but can lead to failures e.g. 
> where the task being awaited never gets scheduled.
> Invoking TriggerReconnectionAttempt from a thread pool thread (or another 
> background thread) is one way to avoid this issue, and using Task.Run() might 
> be the easiest way, even though it may also raise eyebrows. Any performance 
> overhead of Task.Run() shouldn't be a factor, since it is only invoked upon 
> losing connection, not continuously.
> The call to Task.Run() could also be moved into TriggerReconnectionAttempt, 
> like so:
> {code:java}
> // this is invoked using Task.Run, to ensure it runs on a thread pool thread
> // in case this was invoked from inside a lock statement (which it is)
> return Task.Run(async () => await 
> reconnectControl.ScheduleReconnect(Reconnect));{code}
> It does still leave the issue identified in AMQNET-626, that the result is 
> not checked, but it resolves the failover failure caused by calling await 
> inside of a lock.
> (Another way around this would be to use a SemaphoreSlim, or other 
> async-compatible synchronization mechanism instead of a lock statement. 
> However, that could have far-reaching implications, since lock statements are 
> used in many parts of the AMQP implementation.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (AMQNET-656) AMQP failover implementation fails to reconnect in some cases

Reply via email to