[
https://issues.apache.org/jira/browse/ARTEMIS-3303?focusedWorklogId=598503&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-598503
]
ASF GitHub Bot logged work on ARTEMIS-3303:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 18/May/21 09:39
Start Date: 18/May/21 09:39
Worklog Time Spent: 10m
Work Description: franz1981 edited a comment on pull request #3584:
URL: https://github.com/apache/activemq-artemis/pull/3584#issuecomment-843013239
All fair points, and indeed I believe this should be a cautious and more
conservative change but still, there are some historical motivations and
experimental facts that can prove that what we set by default is no longer
valid/usefull and that it was optimizing for context switches here, nor for
throughout or latencies trade-offs.
1. Re the Netty event loop sizing
- historical facts: HornetQ and earlier versions of Artemis was blocking
Netty threads, but that's no longer true. We can even choose to use
`Blockhound` to enforce/check it on our CI, see
https://github.com/netty/netty/pull/9687
- experimental facts: generating a uniformly distributed load with clients
>= cores using Core clients shown that the default configuration of Netty
thread pool (3X number of cores) prevent scaling and is both hitting troughput
and latencies. See
https://github.com/apache/activemq-artemis/pull/3572#issuecomment-841788187 for
some more details about it.
The motivation re the experimental facts seems related how Netty event loop
group works:
- Netty assign client connections in round-robin fashion to the configured
Netty threads
- each client connection can issue write/read events on the event loop
(single) selector to wakeup for any work to do
- if the number of Netty threads exceed the number of cores and the number
of clients is <= Netty threads, each time such notification happen they have
some chance (2/3 possibilities) the thread that's going to handle it won't be
on cpu (because they exceed the amount of cores) and the OS is forced to
deschedule some (random) thread in order to run the Netty thread responsible to
handle the interrupt, causing un-necessary context-switches.
The netty default is of 2X the amount of cores for applications that heavily
relies just on event loop processing, but Artemis it's not: even AMQP use I/O
threads and need GC, compiler threads and sometime global threads to perform
its job. Just using 3X is a waste of resources for the current Artemis version.
2. Re the global thread pool sizing
That's a bit more complex and depends by how `ActiveMQThreadPoolExecutor`
works.
Just writing a simple program can help to spot what's the problem with it
(very similar to the Netty one, but not the same).
```java
public static void main(String[] args) throws InterruptedException {
ThreadPoolExecutor executor = new ActiveMQThreadPoolExecutor(0, 30,
60L, TimeUnit.SECONDS, new ThreadFactory() {
@Override
public Thread newThread(Runnable r) {
Thread t = new Thread(r);
System.err.println("created new thread: " + t);
return t;
}
});
ExecutorFactory factory = new OrderedExecutorFactory(executor);
final int clients = 30;
int bursts = 100;
ConcurrentHashSet[] executingThreads = new ConcurrentHashSet[clients];
ArtemisExecutor[] artemisExecutor = new ArtemisExecutor[clients];
for (int i = 0; i< clients; i++) {
artemisExecutor[i] = factory.getExecutor();
executingThreads[i] = new ConcurrentHashSet();
}
ConcurrentMap<Thread, AtomicLong> executingT = new
ConcurrentHashMap<>();
for (int j = 0; j< bursts;j++) {
for (int i = 0; i < clients; i++) {
ConcurrentHashSet threadsSeen =executingThreads[i];
artemisExecutor[i].execute(() -> {
try {
TimeUnit.MILLISECONDS.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
threadsSeen.add(Thread.currentThread());
AtomicLong counter = executingT.get(Thread.currentThread());
if (counter == null) {
executingT.put(Thread.currentThread(), new AtomicLong(1));
} else {
counter.lazySet(counter.get() + 1);
}
});
}
System.out.println("GC pause");
Thread.sleep(100);
}
for (int i = 0; i< clients; i++) {
artemisExecutor[i].flush(60, TimeUnit.SECONDS);
}
executor.shutdown();
executor.awaitTermination(70, TimeUnit.SECONDS);
System.out.println("Executing threads: " + executingT);
System.out.println("Workload distribution per artemis executor:");
for (int i = 0; i < clients; i++) {
System.out.println("[" + (i + 1) + "] - " +
executingThreads[i].size());
}
}
```
On my machine (12 cores with HT - 6 real cores) it prints 30 times
```created new thread: ...```
and
```
Executing threads:
{Thread[Thread-1,5,]=103,
Thread[Thread-20,5,]=99,
Thread[Thread-17,5,]=99,
Thread[Thread-11,5,]=101,
Thread[Thread-18,5,]=99,
Thread[Thread-14,5,]=100,
Thread[Thread-13,5,]=100,
Thread[Thread-21,5,]=99,
Thread[Thread-24,5,]=98,
Thread[Thread-28,5,]=98,
Thread[Thread-5,5,]=103,
Thread[Thread-30,5,]=97,
Thread[Thread-27,5,]=97,
Thread[Thread-6,5,]=103,
Thread[Thread-4,5,]=102,
Thread[Thread-23,5,]=98,
Thread[Thread-25,5,]=98,
Thread[Thread-8,5,]=102,
Thread[Thread-7,5,]=102,
Thread[Thread-3,5,]=103,
Thread[Thread-9,5,]=101,
Thread[Thread-10,5,]=102,
Thread[Thread-19,5,]=99,
Thread[Thread-12,5,]=101,
Thread[Thread-15,5,]=100,
Thread[Thread-26,5,]=97,
Thread[Thread-29,5,]=97,
Thread[Thread-2,5,]=103,
Thread[Thread-16,5,]=100,
Thread[Thread-22,5,]=99}
Workload distribution per artemis executor:
[1] - 17
[2] - 18
[3] - 17
[4] - 15
[5] - 13
[6] - 13
[7] - 17
[8] - 17
[9] - 18
[10] - 14
[11] - 13
[12] - 17
[13] - 15
[14] - 14
[15] - 17
[16] - 18
[17] - 16
[18] - 12
[19] - 17
[20] - 16
[21] - 16
[22] - 14
[23] - 17
[24] - 17
[25] - 17
[26] - 21
[27] - 20
[28] - 19
[29] - 22
[30] - 18
```
It gives some important info to understand how this thread pool works.
with burst of small enough tasks (but not super small - ~1 ms), issued by
several core clients (30 for this test) with some pauses (100 ms is the g1gc
default pause target):
- the load is spread among all threads ie each thread is getting ~100 tasks
each
- each executor (client) is getting it's tasks executed by different threads
(12->22 on 30 available)
- the number of created threads depends how busy existing ones are
In short, if the global thread executor is going to perform mostly
non-blocking operations (NOTE: the I/O executor is responsible for I/O blocking
ops), with enough clients (clients > available cores) we're going to use the
whole number of threads configured on the pool.
This is ok, given that's what we're expecting by setting 30 as max thread
pool size.
But if the thread max pool size exceed the available cores we will end up,
similartly to the Netty case, to deschedule some at random, just to wake-up the
next one in charge to handle a specific task.
In addition to this problem, there's another one related to the `Workload
distribution`: having each client tasks to be handled by different thread is
ok, but can cause many cache misses because each new thread handling its
workload doesn't know about the task executing context. Reusing the same thread
again (in a more "sticky" way) ensure CPU bounds computations to go faster, as
the thread-per-core application often advocate about.
There are few assumptions to be verified (what if `ArtemisExecutor` kept
busy for too much time a specific Thread, global thread pool tasks cannot
block? etc etc) and more tests to be performed, but this shouldn't stop from
searching for better adaptive (based on the machine spec) default IMO.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 598503)
Time Spent: 3h 10m (was: 3h)
> Default thread pool size is too generous
> ----------------------------------------
>
> Key: ARTEMIS-3303
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3303
> Project: ActiveMQ Artemis
> Issue Type: Improvement
> Reporter: Francesco Nigro
> Assignee: Francesco Nigro
> Priority: Major
> Time Spent: 3h 10m
> Remaining Estimate: 0h
>
> By tweaking thread pool size from default it's possible to easily gain twice
> the troughput: both Netty (acceptor) and global thread pool default sizing
> seems too generous according the available cores of a machine:
> * 3 x cores for the former
> * [0, 30] for the latter (!!!)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)