[ 
https://issues.apache.org/jira/browse/ARTEMIS-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478786#comment-17478786
 ] 

Marcelo Takeshi Fukushima commented on ARTEMIS-3622:
----------------------------------------------------

I think I've found the problem but unfortunately I cannot write a consistently 
failing test case. Basically it goes like this:

 

Thread-1 starts connection, which creates a new MQTTSession with a default 
MQTTSessionState but stalls or starves (either way, pauses for a bit)

The failure-detector-thread kicks in and thinks that the connection created by 
1 has failed* so initiates MQTTConnectionManager.disconnect. In it, it 
synchronizes over the MQTTSessionState of that MQTTConnection but that is still 
the MQTTSessionState.DEFAULT instance (clientId == null). It them proceedes to 
clean up and enters  MQTTSession.stop, a synchronized method. This thread now 
holds both the MQTTSessionState.DEFAULT and MQTTSession monitors but before it 
proceedes further, it gets swapped out.

Thread-1 'resumes' and, in the MQTTConnectionManager.connect, gets either a 
brand new MQTTSessionState or an existing one from the MQTTProtocolManager. It 
them synchronizes over this new MQTTSessionState and updates the MQTTSession 
with the 'correct' MQTTSessionState. Finally, it proceedes to start the session 
by calling MQTTSession.start, which is synchronized but cannot proceed since 
the failure-detector-thread still holds the monitor for the MQTTSession and 
thread-1 is now blocked holding the monitor for the MQTTSessionState that is 
inside the MQTTSession.

The failure-detector-thread tries to resume the MQTTSession.stop, eventually 
calling MQTTSession.clear, a synchronized method. But this MQTTSession is not 
the same as the one that it held the monitor to, so it blocks waiting for 
Thread-1 and we get our deadlock.

*  - this is the part that I'm unsure about. I'm not sure why the failure 
detector kicks in for a freshly created connection. Either way, I think the 
deadlock is a bug.

 

Now, I can see some ways this can be fixed:

1 - mark all mutator methods on MQTTSession as synchronized. That would, in 
theory, protect against thread-1 swapping the MQTTSessionState while the 
failure detector holds the session. This looks weird to me, but not weirder 
than the current version just because you can mutate the MQTTSession in weird 
ways if it is shared by multiple threads (each mutator method can advance at 
its own pace and they may be inconsistent).

2 - mark MQTTConnectionManager.connect and disconnect as synchronized. This 
will effectively make connection and disconnection exclusive. This looks better 
to me, but I don't know the implications - it may create more deadlocks down 
the road.

3 - only create the MQTTSession after the 'correct' MQTTSessionState has been 
created / retrieved and maybe mark the field as final for good measure. This is 
the most disruptive change (will change a large number of classes) but it looks 
to me the most 'correct' (but again, I'm not familiar with the code so this is 
just my guess).

 

I can provide a pull request for 1 and 2 and I can give a shot at 3 if you all 
think it would be better.

Thanks a bunch

takeshi

> MQTT can deadlock on client connection / disconnection
> ------------------------------------------------------
>
>                 Key: ARTEMIS-3622
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3622
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: MQTT
>    Affects Versions: 2.19.0
>         Environment: Using the latest java 17 and artemis 2.19 but looking at 
> the code, it should affect 2.20 as well.
>            Reporter: Marcelo Takeshi Fukushima
>            Priority: Major
>
> It seems that the {{MQTTProtocolHandler}} and {{MQTTConnectionManager}} are 
> on a racing condition and can deadlock themselves on misbehaving clients. I'm 
> including the relevant stack trace (ignore thread 11 that is just waiting for 
> the lock).
> Looking at the relevant code, it seems that the clean-up thread (88 on the 
> {{{}MQTTFailureListener{}}}) starts cleaning up the session state and then 
> the session, but when {{MQTTSession.stop}} calls 
> {{{}MQTTSessionState.clear{}}}, the session state is no longer the same (a 
> racy connection has replaced the session state with a brand new under the 
> same client-id).
> I think the methods connect and disconnect on the {{MQTTConnectionManager}} 
> could be marked as synchronized as a whole, to prevent racy connects / 
> disconnects (but since I don't know all the ins and outs of the code, you 
> guys might have a better fix).
> {noformat}
> Found one Java-level deadlock:
> =============================
> "Thread-11 
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)":
>   waiting to lock monitor 0x00007f6d003368c0 (object 0x000000045f29f240, a 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState),
>   which is held by "Thread-24 
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)"
> "Thread-24 
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)":
>   waiting to lock monitor 0x00007f6d00336a80 (object 0x000000045f2a1068, a 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession),
>   which is held by "Thread-88 
> (ActiveMQ-remoting-threads-ActiveMQServerImpl::name=0.0.0.0-212232499)"
> "Thread-88 
> (ActiveMQ-remoting-threads-ActiveMQServerImpl::name=0.0.0.0-212232499)":
>   waiting to lock monitor 0x00007f6d003368c0 (object 0x000000045f29f240, a 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState),
>   which is held by "Thread-24 
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)"
> Java stack information for the threads listed above:
> ===================================================
> "Thread-11 
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)":
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnectionManager.disconnect(MQTTConnectionManager.java:150)
>   - waiting to lock <0x000000045f29f240> (a 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTFailureListener.connectionFailed(MQTTFailureListener.java:37)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnection.fail(MQTTConnection.java:150)
>   at 
> org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread$2.run(RemotingServiceImpl.java:780)
>   at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>   at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>   at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>   at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$137/0x0000000800e01dc8.run(Unknown
>  Source)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.1/ThreadPoolExecutor.java:1136)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.1/ThreadPoolExecutor.java:635)
>   at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> "Thread-24 
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@640f11a1)":
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession.start(MQTTSession.java:87)
>   - waiting to lock <0x000000045f2a1068> (a 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnectionManager.connect(MQTTConnectionManager.java:111)
>   - locked <0x000000045f29f240> (a 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTProtocolHandler.handleConnect(MQTTProtocolHandler.java:185)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTProtocolHandler.act(MQTTProtocolHandler.java:133)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTProtocolHandler$$Lambda$382/0x0000000801034000.onMessage(Unknown
>  Source)
>   at org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33)
>   at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>   at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$137/0x0000000800e01dc8.run(Unknown
>  Source)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.1/ThreadPoolExecutor.java:1136)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.1/ThreadPoolExecutor.java:635)
>   at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> "Thread-88 
> (ActiveMQ-remoting-threads-ActiveMQServerImpl::name=0.0.0.0-212232499)":
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState.clear(MQTTSessionState.java:59)
>   - waiting to lock <0x000000045f29f240> (a 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession.clean(MQTTSession.java:200)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession.stop(MQTTSession.java:115)
>   - locked <0x000000045f2a1068> (a 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSession)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnectionManager.disconnect(MQTTConnectionManager.java:155)
>   - locked <0x00000002223bc2d0> (a 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTSessionState)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTFailureListener.connectionFailed(MQTTFailureListener.java:37)
>   at 
> org.apache.activemq.artemis.core.protocol.mqtt.MQTTConnection.fail(MQTTConnection.java:150)
>   at 
> org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl.issueFailure(RemotingServiceImpl.java:606)
>   at 
> org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl.connectionDestroyed(RemotingServiceImpl.java:587)
>   at 
> org.apache.activemq.artemis.core.remoting.impl.netty.NettyAcceptor$Listener.connectionDestroyed(NettyAcceptor.java:951)
>   at 
> org.apache.activemq.artemis.core.remoting.impl.netty.ActiveMQChannelHandler.lambda$channelInactive$0(ActiveMQChannelHandler.java:89)
>   at 
> org.apache.activemq.artemis.core.remoting.impl.netty.ActiveMQChannelHandler$$Lambda$419/0x0000000801103b48.run(Unknown
>  Source)
>   at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>   at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>   at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>   at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$137/0x0000000800e01dc8.run(Unknown
>  Source)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.1/ThreadPoolExecutor.java:1136)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.1/ThreadPoolExecutor.java:635)
>   at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Found 1 deadlock.{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to