[jira] [Work logged] (AMQNET-656) AMQP failover implementation fails to reconnect in some cases

ASF GitHub Bot (Jira) Mon, 22 Feb 2021 01:14:06 -0800


     [ 
https://issues.apache.org/jira/browse/AMQNET-656?focusedWorklogId=555713&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-555713
 ]


ASF GitHub Bot logged work on AMQNET-656:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 22/Feb/21 09:13
            Start Date: 22/Feb/21 09:13
    Worklog Time Spent: 10m 
      Work Description: lukeabsent edited a comment on pull request #59:
URL: https://github.com/apache/activemq-nms-amqp/pull/59#issuecomment-783215340


   Had a little look, and it seems overall that there could be more Reconnect() 
going 
   at the same time than just one (which I believe was the original intention), 
and that is one risk. 
   If you suspect deadlock happening around HandleProviderError-lock(SyncRoot) 
then maybe 
   it is being possible cause there is provider.Close inside - though deeper 
inside it seems to be timeouted for 1m.
   Other risk is that HandleProviderError looks like it may be called from 
underlying AsyncPump (AmqpNetLite) thread (inline execution) and if it locks or 
waits on something, then pump stops pumping and if thing being expected is 
something that was supposed to be delivered by AsyncPump, then there is 
deadlock.
   In general, it looks like it requires some work, it would be good to be able 
to reproduce it in some test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 555713)
    Time Spent: 5h 50m  (was: 5h 40m)

> AMQP failover implementation fails to reconnect in some cases
> -------------------------------------------------------------
>
>                 Key: AMQNET-656
>                 URL: https://issues.apache.org/jira/browse/AMQNET-656
>             Project: ActiveMQ .Net
>          Issue Type: Bug
>          Components: AMQP
>    Affects Versions: AMQP-1.8.1
>            Reporter: Bruce Dodson
>            Priority: Major
>          Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> We recently had an issue where some of our producer instances were able to 
> reconnect immediately after the master/slave (or primary/standby) broker 
> cluster failed over, while others never reconnected.
> It appears to be related to two existing JIRAs, AMQNET-624 (addressed in 
> GitHub PR#45) and AMQNET-626 (new issue raised in the same PR, but closed 
> without any changes).
> Regarding the bug identified and fixed in AMQNET-624, part of the original 
> solution was pulled back: where TriggerReconnectionAttempt was called via 
> Task.Run, to instead call it directly. The second issue, AMQNET-626 was to 
> raise concern about the unawaited task returned by TriggerReconnectionAttempt.
> I think perhaps calling from Task.Run may have been beneficial after all: it 
> ensured that TriggerReconnectionAttempt was running on a thread from the 
> thread pool. Otherwise, when TriggerReconnectionAttempt calls 
> ScheduleReconnect, and ScheduleReconnect does an await, that is occurring 
> from within a lock statement.
> As noted in MSDN, an await statement _cannot_ occur inside of a lock 
> statement. That includes anywhere in the call stack, as far as I understand. 
> If you do, it is not caught by the compiler, but can lead to failures e.g. 
> where the task being awaited never gets scheduled.
> Invoking TriggerReconnectionAttempt from a thread pool thread (or another 
> background thread) is one way to avoid this issue, and using Task.Run() might 
> be the easiest way, even though it may also raise eyebrows. Any performance 
> overhead of Task.Run() shouldn't be a factor, since it is only invoked upon 
> losing connection, not continuously.
> The call to Task.Run() could also be moved into TriggerReconnectionAttempt, 
> like so:
> {code:java}
> // this is invoked using Task.Run, to ensure it runs on a thread pool thread
> // in case this was invoked from inside a lock statement (which it is)
> return Task.Run(async () => await 
> reconnectControl.ScheduleReconnect(Reconnect));{code}
> It does still leave the issue identified in AMQNET-626, that the result is 
> not checked, but it resolves the failover failure caused by calling await 
> inside of a lock.
> (Another way around this would be to use a SemaphoreSlim, or other 
> async-compatible synchronization mechanism instead of a lock statement. 
> However, that could have far-reaching implications, since lock statements are 
> used in many parts of the AMQP implementation.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (AMQNET-656) AMQP failover implementation fails to reconnect in some cases

Reply via email to