[ 
https://issues.apache.org/jira/browse/ARTEMIS-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

john lilley closed ARTEMIS-4104.
--------------------------------
    Resolution: Cannot Reproduce

> Services using RPC pattern can deadlock in intra-service calls due to thread 
> starvation 
> ----------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-4104
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4104
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: ActiveMQ-Artemis-Native
>    Affects Versions: 2.27.1
>         Environment: {code:java}
> $ mvn --version
> Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)
> Maven home: F:\bin\apache-maven
> Java version: 17.0.3, vendor: Eclipse Adoptium, runtime: C:\Program 
> Files\Eclipse Adoptium\jdk-17.0.3.7-hotspot
> Default locale: en_US, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows" 
> {code}
>            Reporter: john lilley
>            Assignee: Clebert Suconic
>            Priority: Critical
>         Attachments: amqtest.zip
>
>
> Our application uses AMQ/Artemis as a resilient intermediary to implement 
> RPC-style synchronous services.  All of our calls other than the factory are 
> through the default JMS client.  We don't do anything sophisticated.  Dozens 
> of services implement 100s of methods.  These services can call each other, 
> most notably a lot of services call security to validate tokens, check user 
> permissions, etc.
> We've recently transitioned from AMQ 5 to Artemis, and started noticing 
> timeouts (RPC clients failing to see a message in the reply-to queue) under 
> load.  Analysis showed that these timeouts were due to a certain kind of 
> deadlock, where all allocated AMQ client threads were servicing requests and 
> attempted to call other services.  The Artemis client doesn't allocate more 
> threads under this condition, resulting in thread starvation and deadlock 
> waiting on responses.  This can actually happen immediately, if a single 
> thread is allocated and attempts to call another service.
> In AMQ 5, it seemed that every MessageListener instance that was activated 
> created a new service thread, so this never happened.  I can see why Artemis 
> uses dynamic thread pools, because this is more resource-efficient.  However, 
> it seems to have a critical flaw.
> The attached project builds under maven 3.8 and Java 17.  When run, it 
> randomly decides if a service handler should call another service.  See the 
> Parms class for the controlling settings of the test:
> {code:java}
> public class Parms {
>     public static final int CLIENTS = 10;
>     public static final int SERVERS = 20;
>     public static final double RECURSION_PROBABILITY = 0.1;
>     public static final int ITERS = 10;
> }{code}
> These default settings tend to deadlock about 30% of the time.  Increasing 
> RECURSION_PROBABILITY  to 0.5 results in deadlock nearly every time.  It is 
> notable that SERVERS (the number of \{{MessageListener }}instances) is double 
> CLIENTS (the number of concurrent client-test threads), and yet the deadlock 
> still occurs, so it is pretty clear that we are not running out of 
> \{{MessageListener }}instances.  In production, services never call 
> themselves, the call tree is strictly a DAG without cycles.  This test uses a 
> single service class for simplicity, but it makes no difference.
> I should also point out that I've tried to use globalThreadPools=false and 
> set up multiple connections, hoping each one would get its own thread pool.  
> This was a total failure.  Services were not delivered any messages at all, 
> no matter what variants I tried.  But that's another bug, see ARTEMIS-4105.
> At this point, our only choice appears to abandon all of our Artemis 
> migration effort, or refactor our services into many processes so that each 
> one will get its own thread pool and hopefully this will not occur.  Any help 
> towards a workaround is appreciated.
> Finally, switching between the RPC client using OnMessage() vs receive() for 
> its reply-to queue can be controlled in the test here:
> {code:java}
>     static public RpcClient make() {
> //        return new RpcClientReceive();
>         return new RpcClientOnMessage();
>     }{code}
> But it makes no difference.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to