john lilley created ARTEMIS-4104:
------------------------------------
Summary: Services using RPC pattern can deadlock in intra-service
calls due to thread starvation
Key: ARTEMIS-4104
URL: https://issues.apache.org/jira/browse/ARTEMIS-4104
Project: ActiveMQ Artemis
Issue Type: Bug
Components: ActiveMQ-Artemis-Native
Affects Versions: 2.27.1
Environment: {code:java}
$ mvn --version
Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)
Maven home: F:\bin\apache-maven
Java version: 17.0.3, vendor: Eclipse Adoptium, runtime: C:\Program
Files\Eclipse Adoptium\jdk-17.0.3.7-hotspot
Default locale: en_US, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows" {code}
Reporter: john lilley
Assignee: Clebert Suconic
Our application uses AMQ/Artemis as a resilient intermediary to implement
RPC-style synchronous services. All of our calls other than the factory are
through the default JMS client. We don't do anything sophisticated. Dozens of
services implement 100s of methods. These services can call each other, most
notably a lot of services call security to validate tokens, check user
permissions, etc.
We've recently transitioned from AMQ 5 to Artemis, and started noticing
timeouts under load. Analysis showed that these timeouts were due to a certain
kind of deadlock, where all allocated AMQ client threads were servicing
requests and attempted to call other services. The Artemis client doesn't
allocate more threads under this condition, resulting in thread starvation and
deadlock waiting on responses. This can actually happen immediately, if a
single thread is allocated and attempts to call another service.
In AMQ 5, it seemed that every MessageListener instance that was activated
created a new service thread, so this never happened. I can see why Artemis
uses dynamic thread pools, because this is more resource-efficient. However,
it seems to have a critical flaw.
The attached project builds under maven 3.8 and Java 17. When run, it randomly
decides if a service handler should call another service. See the Parms class
for the controlling sttings of the test:
{code:java}
public class Parms {
public static final int CLIENTS = 10;
public static final int SERVERS = 20;
public static final double RECURSION_PROBABILITY = 0.1;
public static final int ITERS = 10;
}
{code}
These default settings tend to deadlock about 30% of the time. Increasing
RECURSION_PROBABILITY to 0.5 results in deadlock nearly every time. It is
notable that SERVERS (the number of MessageListener instances) is double
CLIENTS (the number of concurrent client-test threads), and yet the deadlock
still occurs, so it is pretty clear that we are not running out of {_}service
instances{_}. In production, services never call themselves, the call tree is
strictly a DAG without cycles. This test uses a single service class for
simplicity, but it makes no difference.
I should also point out that I've tried to use globalThreadPools=false and set
up multiple connections, hoping each one would get its own thread pool. This
was a total failure. Services were not delivered any messages at all, no
matter what variants I tried.
At this point, our only choice appears to abandon all of our Artemis migration
effort, or refactor our services into many processes so that each one will get
its own thread pool and hopefully this will not occur. Any help towards a
workaround is appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)