[jira] [Work logged] (AMQNET-656) AMQP failover implementation fails to reconnect in some cases

ASF GitHub Bot (Jira) Wed, 21 Apr 2021 11:34:04 -0700


     [ 
https://issues.apache.org/jira/browse/AMQNET-656?focusedWorklogId=586795&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-586795
 ]


ASF GitHub Bot logged work on AMQNET-656:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 21/Apr/21 18:33
            Start Date: 21/Apr/21 18:33
    Worklog Time Spent: 10m 
      Work Description: brudo commented on pull request #59:
URL: https://github.com/apache/activemq-nms-amqp/pull/59#issuecomment-824271798


   I tried to see if I could understand this better, including digging into 
AmqpNetLite. In FailoverProvider, the underlying provider and its listener are 
cleared before calling provider.Close(), while Amqp.AmqpObject.NotifyClosed 
protects against calling back more than once from the same instance, so it's 
probably safe from triggering a second call to HandleProviderError there.
   
   In any case my initial approach with adding a call to Task.Run may be more 
heavy-handed than it needed to be, and might be in the wrong place for all we 
know. I notice there is not a single Task.Run, other than the one I added, and 
the ones in the tests.
   
   One approach that doesn't use Task.Run might be something like this: have a 
dedicated background thread, that waits for a manual reset event that is set 
when a reconnect is required. I don't know if that is better, though. It 
guarantees only one reconnect at a time, and doesn't tie up a thread from the 
thread pool, and doesn't require blocking the AsyncPump, but uses a kernel 
handle and an idle thread per failover provider instance.
   
   The next step though should be to come up with a test that confirms the 
issue... hopefully the test will point to the right way to fix it, and if it's 
a quick fix that stands up to review, then perhaps it will make it into the 2.0 
release. I'll be looking into that this afternoon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 586795)
    Time Spent: 7h 20m  (was: 7h 10m)

> AMQP failover implementation fails to reconnect in some cases
> -------------------------------------------------------------
>
>                 Key: AMQNET-656
>                 URL: https://issues.apache.org/jira/browse/AMQNET-656
>             Project: ActiveMQ .Net
>          Issue Type: Bug
>          Components: AMQP
>    Affects Versions: AMQP-1.8.1
>            Reporter: Bruce Dodson
>            Priority: Major
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> We recently had an issue where some of our producer instances were able to 
> reconnect immediately after the master/slave (or primary/standby) broker 
> cluster failed over, while others never reconnected.
> It appears to be related to two existing JIRAs, AMQNET-624 (addressed in 
> GitHub PR#45) and AMQNET-626 (new issue raised in the same PR, but closed 
> without any changes).
> Regarding the bug identified and fixed in AMQNET-624, part of the original 
> solution was pulled back: where TriggerReconnectionAttempt was called via 
> Task.Run, to instead call it directly. The second issue, AMQNET-626 was to 
> raise concern about the unawaited task returned by TriggerReconnectionAttempt.
> I think perhaps calling from Task.Run may have been beneficial after all: it 
> ensured that TriggerReconnectionAttempt was running on a thread from the 
> thread pool. Otherwise, when TriggerReconnectionAttempt calls 
> ScheduleReconnect, and ScheduleReconnect does an await, that is occurring 
> from within a lock statement.
> As noted in MSDN, an await statement _cannot_ occur inside of a lock 
> statement. That includes anywhere in the call stack, as far as I understand. 
> If you do, it is not caught by the compiler, but can lead to failures e.g. 
> where the task being awaited never gets scheduled.
> Invoking TriggerReconnectionAttempt from a thread pool thread (or another 
> background thread) is one way to avoid this issue, and using Task.Run() might 
> be the easiest way, even though it may also raise eyebrows. Any performance 
> overhead of Task.Run() shouldn't be a factor, since it is only invoked upon 
> losing connection, not continuously.
> The call to Task.Run() could also be moved into TriggerReconnectionAttempt, 
> like so:
> {code:java}
> // this is invoked using Task.Run, to ensure it runs on a thread pool thread
> // in case this was invoked from inside a lock statement (which it is)
> return Task.Run(async () => await 
> reconnectControl.ScheduleReconnect(Reconnect));{code}
> It does still leave the issue identified in AMQNET-626, that the result is 
> not checked, but it resolves the failover failure caused by calling await 
> inside of a lock.
> (Another way around this would be to use a SemaphoreSlim, or other 
> async-compatible synchronization mechanism instead of a lock statement. 
> However, that could have far-reaching implications, since lock statements are 
> used in many parts of the AMQP implementation.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (AMQNET-656) AMQP failover implementation fails to reconnect in some cases

Reply via email to