[jira] [Work logged] (ARTEMIS-3303) Default thread pool size is too generous

ASF GitHub Bot (Jira) Tue, 18 May 2021 02:40:34 -0700


     [ 
https://issues.apache.org/jira/browse/ARTEMIS-3303?focusedWorklogId=598503&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-598503
 ]


ASF GitHub Bot logged work on ARTEMIS-3303:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/May/21 09:39
            Start Date: 18/May/21 09:39
    Worklog Time Spent: 10m 
      Work Description: franz1981 edited a comment on pull request #3584:
URL: https://github.com/apache/activemq-artemis/pull/3584#issuecomment-843013239


   All fair points, and indeed I believe this should be a cautious and more 
conservative change but still, there are some historical motivations and 
experimental facts that can prove that what we set by default is no longer 
valid/usefull and that it was optimizing for context switches here, nor for 
throughout or latencies trade-offs.
   
   1. Re the Netty event loop sizing
   
   - historical facts: HornetQ and earlier versions of Artemis was blocking 
Netty threads, but that's no longer true. We can even choose to use 
`Blockhound` to enforce/check it on our CI, see 
https://github.com/netty/netty/pull/9687
   - experimental facts: generating a uniformly distributed load with clients 
>= cores using Core clients shown that the default configuration of Netty 
thread pool (3X number of cores) prevent scaling and is both hitting troughput 
and latencies. See 
https://github.com/apache/activemq-artemis/pull/3572#issuecomment-841788187 for 
some more details about it.
   
   The motivation re the experimental facts seems related how Netty event loop 
group works: 
   - Netty assign client connections in round-robin fashion to the configured 
Netty threads
   - each client connection can issue write/read events on the event loop 
(single) selector to wakeup for any work to do
   - if the number of Netty threads exceed the number of cores and the number 
of clients is <= Netty threads, each time such notification happen they have 
some chance (2/3 possibilities) the thread that's going to handle it won't be 
on cpu (because they exceed the amount of cores) and the OS is forced to 
deschedule some (random) thread in order to run the Netty thread responsible to 
handle the interrupt, causing un-necessary context-switches.
   
   The netty default is of 2X the amount of cores for applications that heavily 
relies just on event loop processing, but Artemis it's not: even AMQP use I/O 
threads and need GC, compiler threads and sometime global threads to perform 
its job. Just using 3X is a waste of resources for the current Artemis version.
   
   2. Re the global thread pool sizing
   
   That's a bit more complex and depends by how `ActiveMQThreadPoolExecutor` 
works.
   Just writing a simple program can help to spot what's the problem with it 
(very similar to the Netty one, but not the same).
   ```java
      public static void main(String[] args) throws InterruptedException {
         ThreadPoolExecutor executor = new ActiveMQThreadPoolExecutor(0, 30, 
60L, TimeUnit.SECONDS, new ThreadFactory() {
            @Override
            public Thread newThread(Runnable r) {
               Thread t = new Thread(r);
               System.err.println("created new thread: " + t);
               return t;
            }
         });
         ExecutorFactory factory = new OrderedExecutorFactory(executor);
         final int clients = 30;
         int bursts = 100;
         ConcurrentHashSet[] executingThreads = new ConcurrentHashSet[clients];
         ArtemisExecutor[] artemisExecutor = new ArtemisExecutor[clients];
         for (int i = 0; i< clients; i++) {
            artemisExecutor[i] = factory.getExecutor();
            executingThreads[i] = new ConcurrentHashSet();
         }
         ConcurrentMap<Thread, AtomicLong> executingT = new 
ConcurrentHashMap<>();
         for (int j = 0; j< bursts;j++) {
            for (int i = 0; i < clients; i++) {
               ConcurrentHashSet threadsSeen =executingThreads[i];
               artemisExecutor[i].execute(() -> {
                  try {
                     TimeUnit.MILLISECONDS.sleep(1);
                  } catch (InterruptedException e) {
                     e.printStackTrace();
                  }
                  threadsSeen.add(Thread.currentThread());
                  AtomicLong counter = executingT.get(Thread.currentThread());
                  if (counter == null) {
                     executingT.put(Thread.currentThread(), new AtomicLong(1));
                  } else {
                     counter.lazySet(counter.get() + 1);
                  }
               });
            }
            System.out.println("GC pause");
            Thread.sleep(100);
         }
         for (int i = 0; i< clients; i++) {
            artemisExecutor[i].flush(60, TimeUnit.SECONDS);
         }
         executor.shutdown();
         executor.awaitTermination(70, TimeUnit.SECONDS);
         System.out.println("Executing threads: " + executingT);
         System.out.println("Workload distribution per artemis executor:");
         for (int i = 0; i < clients; i++) {
            System.out.println("[" + (i + 1) + "] - " + 
executingThreads[i].size());
         }
      }
   ```
   On my machine (12 cores with HT - 6 real cores) it prints 30 times 
   ```created new thread: ...```
   and 
   ```
   Executing threads: 
   {Thread[Thread-1,5,]=103, 
   Thread[Thread-20,5,]=99, 
   Thread[Thread-17,5,]=99, 
   Thread[Thread-11,5,]=101, 
   Thread[Thread-18,5,]=99, 
   Thread[Thread-14,5,]=100, 
   Thread[Thread-13,5,]=100, 
   Thread[Thread-21,5,]=99, 
   Thread[Thread-24,5,]=98, 
   Thread[Thread-28,5,]=98, 
   Thread[Thread-5,5,]=103, 
   Thread[Thread-30,5,]=97, 
   Thread[Thread-27,5,]=97, 
   Thread[Thread-6,5,]=103, 
   Thread[Thread-4,5,]=102, 
   Thread[Thread-23,5,]=98, 
   Thread[Thread-25,5,]=98, 
   Thread[Thread-8,5,]=102, 
   Thread[Thread-7,5,]=102,
   Thread[Thread-3,5,]=103,
   Thread[Thread-9,5,]=101, 
   Thread[Thread-10,5,]=102, 
   Thread[Thread-19,5,]=99, 
   Thread[Thread-12,5,]=101, 
   Thread[Thread-15,5,]=100, 
   Thread[Thread-26,5,]=97, 
   Thread[Thread-29,5,]=97, 
   Thread[Thread-2,5,]=103,
   Thread[Thread-16,5,]=100, 
   Thread[Thread-22,5,]=99}
   Workload distribution per artemis executor:
   [1] - 17
   [2] - 18
   [3] - 17
   [4] - 15
   [5] - 13
   [6] - 13
   [7] - 17
   [8] - 17
   [9] - 18
   [10] - 14
   [11] - 13
   [12] - 17
   [13] - 15
   [14] - 14
   [15] - 17
   [16] - 18
   [17] - 16
   [18] - 12
   [19] - 17
   [20] - 16
   [21] - 16
   [22] - 14
   [23] - 17
   [24] - 17
   [25] - 17
   [26] - 21
   [27] - 20
   [28] - 19
   [29] - 22
   [30] - 18
   ```
   It gives some important info to understand how this thread pool works.
   with burst of small enough tasks (but not super small - ~1 ms), issued by 
several core clients (30 for this test) with some pauses (100 ms is the g1gc 
default pause target): 
   
   - the load is spread among all threads ie each thread is getting ~100 tasks 
each
   - each executor (client) is getting it's tasks executed by different threads 
(12->22 on 30 available)
   - the number of created threads depends how busy existing ones are
   
   In short, if the global thread executor is going to perform mostly 
non-blocking operations (NOTE: the I/O executor is responsible for I/O blocking 
ops), with enough clients (clients > available cores) we're going to use the 
whole number of threads configured on the pool. 
   This is ok, given that's what we're expecting by setting 30 as max thread 
pool size.
   But if the thread max pool size exceed the available cores we will end up, 
similartly to the Netty case, to deschedule some at random, just to wake-up the 
next one in charge to handle a specific task.
   In addition to this problem, there's another one related to the `Workload 
distribution`: having each client tasks to be handled by different thread is 
ok, but can cause many cache misses because each new thread handling its 
workload doesn't know about the task executing context. Reusing the same thread 
again (in a more "sticky" way) ensure CPU bounds computations to go faster, as 
the thread-per-core application often advocate about. 
   
   There are few assumptions to be verified (what if `ArtemisExecutor` kept 
busy for too much time a specific Thread, global thread pool tasks cannot 
block? etc etc) and more tests to be performed, but this shouldn't stop from 
searching for better adaptive (based on the machine spec) default IMO.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 598503)
    Time Spent: 3h 10m  (was: 3h)

> Default thread pool size is too generous
> ----------------------------------------
>
>                 Key: ARTEMIS-3303
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3303
>             Project: ActiveMQ Artemis
>          Issue Type: Improvement
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Major
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> By tweaking thread pool size from default it's possible to easily gain twice 
> the troughput: both Netty (acceptor) and global thread pool default sizing 
> seems too generous according the available cores of a machine: 
> * 3 x cores for the former
> * [0, 30] for the latter (!!!)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (ARTEMIS-3303) Default thread pool size is too generous

Reply via email to