[
https://issues.apache.org/jira/browse/ARTEMIS-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
john lilley closed ARTEMIS-4104.
--------------------------------
Resolution: Cannot Reproduce
> Services using RPC pattern can deadlock in intra-service calls due to thread
> starvation
> ----------------------------------------------------------------------------------------
>
> Key: ARTEMIS-4104
> URL: https://issues.apache.org/jira/browse/ARTEMIS-4104
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: ActiveMQ-Artemis-Native
> Affects Versions: 2.27.1
> Environment: {code:java}
> $ mvn --version
> Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)
> Maven home: F:\bin\apache-maven
> Java version: 17.0.3, vendor: Eclipse Adoptium, runtime: C:\Program
> Files\Eclipse Adoptium\jdk-17.0.3.7-hotspot
> Default locale: en_US, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
> {code}
> Reporter: john lilley
> Assignee: Clebert Suconic
> Priority: Critical
> Attachments: amqtest.zip
>
>
> Our application uses AMQ/Artemis as a resilient intermediary to implement
> RPC-style synchronous services. All of our calls other than the factory are
> through the default JMS client. We don't do anything sophisticated. Dozens
> of services implement 100s of methods. These services can call each other,
> most notably a lot of services call security to validate tokens, check user
> permissions, etc.
> We've recently transitioned from AMQ 5 to Artemis, and started noticing
> timeouts (RPC clients failing to see a message in the reply-to queue) under
> load. Analysis showed that these timeouts were due to a certain kind of
> deadlock, where all allocated AMQ client threads were servicing requests and
> attempted to call other services. The Artemis client doesn't allocate more
> threads under this condition, resulting in thread starvation and deadlock
> waiting on responses. This can actually happen immediately, if a single
> thread is allocated and attempts to call another service.
> In AMQ 5, it seemed that every MessageListener instance that was activated
> created a new service thread, so this never happened. I can see why Artemis
> uses dynamic thread pools, because this is more resource-efficient. However,
> it seems to have a critical flaw.
> The attached project builds under maven 3.8 and Java 17. When run, it
> randomly decides if a service handler should call another service. See the
> Parms class for the controlling settings of the test:
> {code:java}
> public class Parms {
> public static final int CLIENTS = 10;
> public static final int SERVERS = 20;
> public static final double RECURSION_PROBABILITY = 0.1;
> public static final int ITERS = 10;
> }{code}
> These default settings tend to deadlock about 30% of the time. Increasing
> RECURSION_PROBABILITY to 0.5 results in deadlock nearly every time. It is
> notable that SERVERS (the number of \{{MessageListener }}instances) is double
> CLIENTS (the number of concurrent client-test threads), and yet the deadlock
> still occurs, so it is pretty clear that we are not running out of
> \{{MessageListener }}instances. In production, services never call
> themselves, the call tree is strictly a DAG without cycles. This test uses a
> single service class for simplicity, but it makes no difference.
> I should also point out that I've tried to use globalThreadPools=false and
> set up multiple connections, hoping each one would get its own thread pool.
> This was a total failure. Services were not delivered any messages at all,
> no matter what variants I tried. But that's another bug, see ARTEMIS-4105.
> At this point, our only choice appears to abandon all of our Artemis
> migration effort, or refactor our services into many processes so that each
> one will get its own thread pool and hopefully this will not occur. Any help
> towards a workaround is appreciated.
> Finally, switching between the RPC client using OnMessage() vs receive() for
> its reply-to queue can be controlled in the test here:
> {code:java}
> static public RpcClient make() {
> // return new RpcClientReceive();
> return new RpcClientOnMessage();
> }{code}
> But it makes no difference.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)