[jira] [Updated] (ARTEMIS-4104) Services using RPC pattern can deadlock in intra-service calls due to thread starvation

john lilley (Jira) Fri, 02 Dec 2022 16:32:05 -0800


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-4104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


john lilley updated ARTEMIS-4104:
---------------------------------
    Description: 
Our application uses AMQ/Artemis as a resilient intermediary to implement 
RPC-style synchronous services.  All of our API calls (other than the 
connection factory) are through the default JMS client.  We don't do anything 
sophisticated.  Dozens of services implement 100s of methods.  These services 
can call each other, most notably a lot of services call security to validate 
tokens, check user permissions, etc.

We've recently transitioned from AMQ 5 to Artemis, and started noticing 
timeouts under load.  Analysis showed that these timeouts were due to a certain 
kind of deadlock, where all allocated AMQ client threads were servicing 
requests and attempted to call other services.  The Artemis client doesn't 
allocate more threads under this condition, resulting in thread starvation and 
deadlock waiting on responses.  This can actually happen immediately, if a 
single thread is allocated and attempts to call another service.

In AMQ 5, it seemed that every MessageListener instance that was activated 
created a new service thread, so this never happened.  I can see why Artemis 
uses dynamic thread pools, because this is more resource-efficient.  However, 
it seems to have a critical flaw.

The attached project builds under maven 3.8 and Java 17.  When run, it randomly 
decides if a service handler should call another service.  See the Parms class 
for the controlling settings of the test:
{code:java}
public class Parms {
    public static final int CLIENTS = 10;
    public static final int SERVERS = 20;
    public static final double RECURSION_PROBABILITY = 0.1;
    public static final int ITERS = 10;
}
 {code}
These default settings tend to deadlock about 30% of the time.  Increasing 
RECURSION_PROBABILITY  to 0.5 results in deadlock nearly every time.  It is 
notable that SERVERS (the number of MessageListener instances) is double 
CLIENTS (the number of concurrent client-test threads), and yet the deadlock 
still occurs, so it is pretty clear that we are not running out of {_}service 
instances{_}.  In production, services never call themselves, the call tree is 
strictly a DAG without cycles.  This test uses a single service class for 
simplicity, but it makes no difference.

I should also point out that I've tried to use {{globalThreadPools=false}} and 
set up multiple connections, hoping each one would get its own thread pool.  
This was a total failure.  Services were not delivered any messages at all, no 
matter what variants I tried.

At this point, our only choice appears to abandon all of our Artemis migration 
effort, or refactor our services into many processes so that each one will get 
its own thread pool and hopefully this will not occur.  Any help towards a 
workaround is appreciated.

 

  was:
Our application uses AMQ/Artemis as a resilient intermediary to implement 
RPC-style synchronous services.  All of our calls other than the factory are 
through the default JMS client.  We don't do anything sophisticated.  Dozens of 
services implement 100s of methods.  These services can call each other, most 
notably a lot of services call security to validate tokens, check user 
permissions, etc.

We've recently transitioned from AMQ 5 to Artemis, and started noticing 
timeouts under load.  Analysis showed that these timeouts were due to a certain 
kind of deadlock, where all allocated AMQ client threads were servicing 
requests and attempted to call other services.  The Artemis client doesn't 
allocate more threads under this condition, resulting in thread starvation and 
deadlock waiting on responses.  This can actually happen immediately, if a 
single thread is allocated and attempts to call another service.

In AMQ 5, it seemed that every MessageListener instance that was activated 
created a new service thread, so this never happened.  I can see why Artemis 
uses dynamic thread pools, because this is more resource-efficient.  However, 
it seems to have a critical flaw.

The attached project builds under maven 3.8 and Java 17.  When run, it randomly 
decides if a service handler should call another service.  See the Parms class 
for the controlling sttings of the test:

 
{code:java}
public class Parms {
    public static final int CLIENTS = 10;
    public static final int SERVERS = 20;
    public static final double RECURSION_PROBABILITY = 0.1;
    public static final int ITERS = 10;
}
 {code}
These default settings tend to deadlock about 30% of the time.  Increasing 
RECURSION_PROBABILITY  to 0.5 results in deadlock nearly every time.  It is 
notable that SERVERS (the number of MessageListener instances) is double 
CLIENTS (the number of concurrent client-test threads), and yet the deadlock 
still occurs, so it is pretty clear that we are not running out of {_}service 
instances{_}.  In production, services never call themselves, the call tree is 
strictly a DAG without cycles.  This test uses a single service class for 
simplicity, but it makes no difference.

I should also point out that I've tried to use globalThreadPools=false and set 
up multiple connections, hoping each one would get its own thread pool.  This 
was a total failure.  Services were not delivered any messages at all, no 
matter what variants I tried.

At this point, our only choice appears to abandon all of our Artemis migration 
effort, or refactor our services into many processes so that each one will get 
its own thread pool and hopefully this will not occur.  Any help towards a 
workaround is appreciated.

 


> Services using RPC pattern can deadlock in intra-service calls due to thread 
> starvation 
> ----------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-4104
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-4104
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: ActiveMQ-Artemis-Native
>    Affects Versions: 2.27.1
>         Environment: {code:java}
> $ mvn --version
> Apache Maven 3.8.6 (84538c9988a25aec085021c365c560670ad80f63)
> Maven home: F:\bin\apache-maven
> Java version: 17.0.3, vendor: Eclipse Adoptium, runtime: C:\Program 
> Files\Eclipse Adoptium\jdk-17.0.3.7-hotspot
> Default locale: en_US, platform encoding: Cp1252
> OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows" 
> {code}
>            Reporter: john lilley
>            Assignee: Clebert Suconic
>            Priority: Critical
>         Attachments: amqtest.zip
>
>
> Our application uses AMQ/Artemis as a resilient intermediary to implement 
> RPC-style synchronous services.  All of our API calls (other than the 
> connection factory) are through the default JMS client.  We don't do anything 
> sophisticated.  Dozens of services implement 100s of methods.  These services 
> can call each other, most notably a lot of services call security to validate 
> tokens, check user permissions, etc.
> We've recently transitioned from AMQ 5 to Artemis, and started noticing 
> timeouts under load.  Analysis showed that these timeouts were due to a 
> certain kind of deadlock, where all allocated AMQ client threads were 
> servicing requests and attempted to call other services.  The Artemis client 
> doesn't allocate more threads under this condition, resulting in thread 
> starvation and deadlock waiting on responses.  This can actually happen 
> immediately, if a single thread is allocated and attempts to call another 
> service.
> In AMQ 5, it seemed that every MessageListener instance that was activated 
> created a new service thread, so this never happened.  I can see why Artemis 
> uses dynamic thread pools, because this is more resource-efficient.  However, 
> it seems to have a critical flaw.
> The attached project builds under maven 3.8 and Java 17.  When run, it 
> randomly decides if a service handler should call another service.  See the 
> Parms class for the controlling settings of the test:
> {code:java}
> public class Parms {
>     public static final int CLIENTS = 10;
>     public static final int SERVERS = 20;
>     public static final double RECURSION_PROBABILITY = 0.1;
>     public static final int ITERS = 10;
> }
>  {code}
> These default settings tend to deadlock about 30% of the time.  Increasing 
> RECURSION_PROBABILITY  to 0.5 results in deadlock nearly every time.  It is 
> notable that SERVERS (the number of MessageListener instances) is double 
> CLIENTS (the number of concurrent client-test threads), and yet the deadlock 
> still occurs, so it is pretty clear that we are not running out of {_}service 
> instances{_}.  In production, services never call themselves, the call tree 
> is strictly a DAG without cycles.  This test uses a single service class for 
> simplicity, but it makes no difference.
> I should also point out that I've tried to use {{globalThreadPools=false}} 
> and set up multiple connections, hoping each one would get its own thread 
> pool.  This was a total failure.  Services were not delivered any messages at 
> all, no matter what variants I tried.
> At this point, our only choice appears to abandon all of our Artemis 
> migration effort, or refactor our services into many processes so that each 
> one will get its own thread pool and hopefully this will not occur.  Any help 
> towards a workaround is appreciated.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARTEMIS-4104) Services using RPC pattern can deadlock in intra-service calls due to thread starvation

Reply via email to