[ https://issues.apache.org/jira/browse/AMQNET-656?focusedWorklogId=555713&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-555713 ]
ASF GitHub Bot logged work on AMQNET-656: ----------------------------------------- Author: ASF GitHub Bot Created on: 22/Feb/21 09:13 Start Date: 22/Feb/21 09:13 Worklog Time Spent: 10m Work Description: lukeabsent edited a comment on pull request #59: URL: https://github.com/apache/activemq-nms-amqp/pull/59#issuecomment-783215340 Had a little look, and it seems overall that there could be more Reconnect() going at the same time than just one (which I believe was the original intention), and that is one risk. If you suspect deadlock happening around HandleProviderError-lock(SyncRoot) then maybe it is being possible cause there is provider.Close inside - though deeper inside it seems to be timeouted for 1m. Other risk is that HandleProviderError looks like it may be called from underlying AsyncPump (AmqpNetLite) thread (inline execution) and if it locks or waits on something, then pump stops pumping and if thing being expected is something that was supposed to be delivered by AsyncPump, then there is deadlock. In general, it looks like it requires some work, it would be good to be able to reproduce it in some test. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 555713) Time Spent: 5h 50m (was: 5h 40m) > AMQP failover implementation fails to reconnect in some cases > ------------------------------------------------------------- > > Key: AMQNET-656 > URL: https://issues.apache.org/jira/browse/AMQNET-656 > Project: ActiveMQ .Net > Issue Type: Bug > Components: AMQP > Affects Versions: AMQP-1.8.1 > Reporter: Bruce Dodson > Priority: Major > Time Spent: 5h 50m > Remaining Estimate: 0h > > We recently had an issue where some of our producer instances were able to > reconnect immediately after the master/slave (or primary/standby) broker > cluster failed over, while others never reconnected. > It appears to be related to two existing JIRAs, AMQNET-624 (addressed in > GitHub PR#45) and AMQNET-626 (new issue raised in the same PR, but closed > without any changes). > Regarding the bug identified and fixed in AMQNET-624, part of the original > solution was pulled back: where TriggerReconnectionAttempt was called via > Task.Run, to instead call it directly. The second issue, AMQNET-626 was to > raise concern about the unawaited task returned by TriggerReconnectionAttempt. > I think perhaps calling from Task.Run may have been beneficial after all: it > ensured that TriggerReconnectionAttempt was running on a thread from the > thread pool. Otherwise, when TriggerReconnectionAttempt calls > ScheduleReconnect, and ScheduleReconnect does an await, that is occurring > from within a lock statement. > As noted in MSDN, an await statement _cannot_ occur inside of a lock > statement. That includes anywhere in the call stack, as far as I understand. > If you do, it is not caught by the compiler, but can lead to failures e.g. > where the task being awaited never gets scheduled. > Invoking TriggerReconnectionAttempt from a thread pool thread (or another > background thread) is one way to avoid this issue, and using Task.Run() might > be the easiest way, even though it may also raise eyebrows. Any performance > overhead of Task.Run() shouldn't be a factor, since it is only invoked upon > losing connection, not continuously. > The call to Task.Run() could also be moved into TriggerReconnectionAttempt, > like so: > {code:java} > // this is invoked using Task.Run, to ensure it runs on a thread pool thread > // in case this was invoked from inside a lock statement (which it is) > return Task.Run(async () => await > reconnectControl.ScheduleReconnect(Reconnect));{code} > It does still leave the issue identified in AMQNET-626, that the result is > not checked, but it resolves the failover failure caused by calling await > inside of a lock. > (Another way around this would be to use a SemaphoreSlim, or other > async-compatible synchronization mechanism instead of a lock statement. > However, that could have far-reaching implications, since lock statements are > used in many parts of the AMQP implementation.) -- This message was sent by Atlassian Jira (v8.3.4#803005)