[ https://issues.apache.org/jira/browse/ARTEMIS-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478786#comment-17478786 ]
Marcelo Takeshi Fukushima commented on ARTEMIS-3622: ---------------------------------------------------- I think I've found the problem but unfortunately I cannot write a consistently failing test case. Basically it goes like this: Thread-1 starts connection, which creates a new MQTTSession with a default MQTTSessionState but stalls or starves (either way, pauses for a bit) The failure-detector-thread kicks in and thinks that the connection created by 1 has failed* so initiates MQTTConnectionManager.disconnect. In it, it synchronizes over the MQTTSessionState of that MQTTConnection but that is still the MQTTSessionState.DEFAULT instance (clientId == null). It them proceedes to clean up and enters MQTTSession.stop, a synchronized method. This thread now holds both the MQTTSessionState.DEFAULT and MQTTSession monitors but before it proceedes further, it gets swapped out. Thread-1 'resumes' and, in the MQTTConnectionManager.connect, gets either a brand new MQTTSessionState or an existing one from the MQTTProtocolManager. It them synchronizes over this new MQTTSessionState and updates the MQTTSession with the 'correct' MQTTSessionState. Finally, it proceedes to start the session by calling MQTTSession.start, which is synchronized but cannot proceed since the failure-detector-thread still holds the monitor for the MQTTSession and thread-1 is now blocked holding the monitor for the MQTTSessionState that is inside the MQTTSession. The failure-detector-thread tries to resume the MQTTSession.stop, eventually calling MQTTSession.clear, a synchronized method. But this MQTTSession is not the same as the one that it held the monitor to, so it blocks waiting for Thread-1 and we get our deadlock. * - this is the part that I'm unsure about. I'm not sure why the failure detector kicks in for a freshly created connection. Either way, I think the deadlock is a bug. Now, I can see some ways this can be fixed: 1 - mark all mutator methods on MQTTSession as synchronized. That would, in theory, protect against thread-1 swapping the MQTTSessionState while the failure detector holds the session. This looks weird to me, but not weirder than the current version just because you can mutate the MQTTSession in weird ways if it is shared by multiple threads (each mutator method can advance at its own pace and they may be inconsistent). 2 - mark MQTTConnectionManager.connect and disconnect as synchronized. This will effectively make connection and disconnection exclusive. This looks better to me, but I don't know the implications - it may create more deadlocks down the road. 3 - only create the MQTTSession after the 'correct' MQTTSessionState has been created / retrieved and maybe mark the field as final for good measure. This is the most disruptive change (will change a large number of classes) but it looks to me the most 'correct' (but again, I'm not familiar with the code so this is just my guess). I can provide a pull request for 1 and 2 and I can give a shot at 3 if you all think it would be better. Thanks a bunch takeshi > MQTT can deadlock on client connection / disconnection > ------------------------------------------------------ > > Key: ARTEMIS-3622 > URL: https://issues.apache.org/jira/browse/ARTEMIS-3622 > Project: ActiveMQ Artemis > Issue Type: Bug > Components: MQTT > Affects Versions: 2.19.0 > Environment: Using the latest java 17 and artemis 2.19 but looking at > the code, it should affect 2.20 as well. > Reporter: Marcelo Takeshi Fukushima > Priority: Major > > It seems that the {{MQTTProtocolHandler}} and {{MQTTConnectionManager}} are > on a racing condition and can deadlock themselves on misbehaving clients. I'm > including the relevant stack trace (ignore thread 11 that is just waiting for > the lock). > Looking at the relevant code, it seems that the clean-up thread (88 on the > {{{}MQTTFailureListener{}}}) starts cleaning up the session state and then > the session, but when {{MQTTSession.stop}} calls > {{{}MQTTSessionState.clear{}}}, the session state is no longer the same (a > racy connection has replaced the session state with a brand new under the > same client-id). > I think the methods connect and disconnect on the {{MQTTConnectionManager}} > could be marked as synchronized as a whole, to prevent racy connects / > disconnects (but since I don't know all the ins and outs of the code, you > guys might have a better fix). > {noformat} > Found one Java-level deadlock: > ============================= > "Thread-11 > (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)": > waiting to lock monitor 0x00007f6d003368c0 (object 0x000000045f29f240, a > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState), > which is held by "Thread-24 > (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)" > "Thread-24 > (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)": > waiting to lock monitor 0x00007f6d00336a80 (object 0x000000045f2a1068, a > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession), > which is held by "Thread-88 > (ActiveMQ-remoting-threads-ActiveMQServerImpl::name=0.0.0.0-212232499)" > "Thread-88 > (ActiveMQ-remoting-threads-ActiveMQServerImpl::name=0.0.0.0-212232499)": > waiting to lock monitor 0x00007f6d003368c0 (object 0x000000045f29f240, a > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState), > which is held by "Thread-24 > (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)" > Java stack information for the threads listed above: > =================================================== > "Thread-11 > (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)": > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnectionManager.disconnect(MQTTConnectionManager.java:150) > - waiting to lock <0x000000045f29f240> (a > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTFailureListener.connectionFailed(MQTTFailureListener.java:37) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnection.fail(MQTTConnection.java:150) > at > org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread$2.run(RemotingServiceImpl.java:780) > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$137/0x0000000800e01dc8.run(Unknown > Source) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.1/ThreadPoolExecutor.java:1136) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.1/ThreadPoolExecutor.java:635) > at > org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) > "Thread-24 > (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)": > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession.start(MQTTSession.java:87) > - waiting to lock <0x000000045f2a1068> (a > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnectionManager.connect(MQTTConnectionManager.java:111) > - locked <0x000000045f29f240> (a > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTProtocolHandler.handleConnect(MQTTProtocolHandler.java:185) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTProtocolHandler.act(MQTTProtocolHandler.java:133) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTProtocolHandler$$Lambda$382/0x0000000801034000.onMessage(Unknown > Source) > at org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$137/0x0000000800e01dc8.run(Unknown > Source) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.1/ThreadPoolExecutor.java:1136) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.1/ThreadPoolExecutor.java:635) > at > org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) > "Thread-88 > (ActiveMQ-remoting-threads-ActiveMQServerImpl::name=0.0.0.0-212232499)": > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState.clear(MQTTSessionState.java:59) > - waiting to lock <0x000000045f29f240> (a > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession.clean(MQTTSession.java:200) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession.stop(MQTTSession.java:115) > - locked <0x000000045f2a1068> (a > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnectionManager.disconnect(MQTTConnectionManager.java:155) > - locked <0x00000002223bc2d0> (a > org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTFailureListener.connectionFailed(MQTTFailureListener.java:37) > at > org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnection.fail(MQTTConnection.java:150) > at > org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl.issueFailure(RemotingServiceImpl.java:606) > at > org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl.connectionDestroyed(RemotingServiceImpl.java:587) > at > org.apache.activemq.artemis.core.remoting.impl.netty.NettyAcceptor$Listener.connectionDestroyed(NettyAcceptor.java:951) > at > org.apache.activemq.artemis.core.remoting.impl.netty.ActiveMQChannelHandler.lambda$channelInactive$0(ActiveMQChannelHandler.java:89) > at > org.apache.activemq.artemis.core.remoting.impl.netty.ActiveMQChannelHandler$$Lambda$419/0x0000000801103b48.run(Unknown > Source) > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) > at > org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) > at > org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$137/0x0000000800e01dc8.run(Unknown > Source) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.1/ThreadPoolExecutor.java:1136) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.1/ThreadPoolExecutor.java:635) > at > org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) > Found 1 deadlock.{noformat} > -- This message was sent by Atlassian Jira (v8.20.1#820001)