Re: Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Jianbin Chen Fri, 23 Jan 2026 07:01:46 -0800

I'm sorry — I forgot to mention the machine I used for the load test. My
server is 2 cores and 4 GB RAM, and the JVM heap was set to 2880m. Under my
test load (about 20,000 QPS), with non‑pooled virtual threads you generate
at least 20,000 × 8 KB = ~156 MB of byte[] allocations per second just from
that 8 KB buffer; that doesn't include other object allocations. With a
2880 MB heap this allocation rate already forces very frequent GC, and
frequent GC raises CPU usage, which in turn significantly increases average
response time and p99 / p999 latency.


Pooling is usually introduced to solve performance issues — object pools
and connection pools exist to quickly reuse cached resources and improve
performance. So pooling virtual threads also yields obvious benefits,
especially for memory‑constrained, I/O‑bound applications (gateways,
proxies, etc.) that are sensitive to latency.

Best Regards.
Jianbin Chen, github-id: funky-eyes

Robert Engels <[email protected]> 于 2026年1月23日周五 22:20写道：

> I understand. I was trying explain how you can not use thread locals and
> maintain the performance. It’s unlikely allocating a 8k buffer is a
> performance bottleneck in a real program if the task is not cpu bound
> (depending on the granularity you make your tasks) - but 2M tasks running
> simultaneously would require 16gb of memory not including the stack.
>
> You cannot simply use the thread per task model without an understanding
> of the cpu, IO, and memory footprints of your tasks and then configure
> appropriately.
>
> On Jan 23, 2026, at 8:10 AM, Jianbin Chen <[email protected]> wrote:
>
> 
> I'm sorry, Robert—perhaps I didn't explain my example clearly enough.
> Here's the code in question:
>
> ```java
> Executor executor2 = new ThreadPoolExecutor(
>     200,
>     Integer.MAX_VALUE,
>     0L,
>     java.util.concurrent.TimeUnit.SECONDS,
>     new SynchronousQueue<>(),
>     Thread.ofVirtual().name("test-threadpool-", 1).factory()
> );
> ```
>
> In this example, the pooled virtual threads don't implement any
> backpressure mechanism; they simply maintain a core pool of 200 virtual
> threads. Given that the queue is a `SynchronousQueue` and the maximum pool
> size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed 200,
> its behavior becomes identical to that of non-pooled virtual threads.
>
> From my perspective, this example demonstrates that the benefits of
> pooling virtual threads outweigh those of creating a new virtual thread for
> every single task. In IO-bound scenarios, the virtual threads are directly
> reused rather than being recreated each time, and the memory footprint of
> virtual threads is far smaller than that of platform threads (which are
> controlled by the `-Xss` flag). Additionally, with pooled virtual threads,
> the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can
> also be reused, which further reduces overall memory usage—wouldn't you
> agree?
>
> Best Regards.
> Jianbin Chen, github-id: funky-eyes
>
> Robert Engels <[email protected]> 于 2026年1月23日周五 21:52写道：
>
>> Because VT are so efficient to create, without any back pressure they
>> will all be created and running at essentially the same time (dramatically
>> raising the amount of memory in use) - versus with a pool of size N you
>> will have at most N running at once. In a REAL WORLD application there are
>> often external limiters (like number of tcp connections) that provide a
>> limit.
>>
>> If your tasks are purely cpu bound you should probably be using a capped
>> thread pool of platform threads as it makes no sense to have more threads
>> than available cores.
>>
>>
>>
>> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <[email protected]> wrote:
>>
>> 
>> The question is why I need to use a semaphore to control the number of
>> concurrently running tasks. In my particular example, the goal is simply to
>> keep the concurrency level the same across different thread pool
>> implementations so I can fairly compare which one completes all the tasks
>> faster. This isn't solely about memory consumption—purely from a
>> **performance** perspective (e.g., total throughput or wall-clock time to
>> finish the workload), the same number of concurrent tasks completes
>> noticeably faster when using pooled virtual threads.
>>
>> My email probably didn't explain this clearly enough. In reality, I have
>> two main questions:
>>
>> 1. When a third-party library uses `ThreadLocal` as a cache/pool (e.g.,
>> to hold expensive reusable objects like connections, formatters, or
>> parsers), is switching to a **pooled virtual thread executor** the only
>> viable solution—assuming we cannot modify the third-party library code?
>>
>> 2. When running the exact same number of concurrent tasks, pooled virtual
>> threads deliver better performance.
>>
>> Both questions point toward the same conclusion: for an application
>> originally built around a traditional platform thread pool, after upgrading
>> to JDK 21/25, moving to a **pooled virtual threads** approach is generally
>> superior to simply using non-pooled (unbounded) virtual threads.
>>
>> If any part of this reasoning or conclusion is mistaken, I would really
>> appreciate being corrected — thank you very much in advance for any
>> feedback or different experiences you can share!
>>
>> Best Regards.
>> Jianbin Chen, github-id: funky-eyes
>>
>> robert engels <[email protected]> 于 2026年1月23日周五 20:58写道：
>>
>>> Exactly, this is your problem. The total number of tasks will all be
>>> running at once in the thread per task model.
>>>
>>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <[email protected]> wrote:
>>>
>>> 
>>> Hi Robert,
>>>
>>> Thanks you, but I'm a bit confused. In the example above, I only set the
>>> core pool size to 200 virtual threads, but for the specific test case we’re
>>> talking about, the concurrency isn’t actually being limited by the pool
>>> size at all. Since the maximum thread count is Integer.MAX_VALUE and it’s
>>> using a SynchronousQueue, tasks are handed off immediately and a new thread
>>> gets created to run them right away anyway.
>>>
>>> Best Regards.
>>> Jianbin Chen, github-id: funky-eyes
>>>
>>> robert engels <[email protected]> 于 2026年1月23日周五 20:28写道：
>>>
>>>> Try using a semaphore to limit the maximum number of tasks in progress
>>>> at anyone time - that is what is causing your memory spike. Think of it
>>>> this way since VT threads are so cheap to create - you are essentially
>>>> creating them all at once - making the working set size equally to the
>>>> maximum.  So you have N * WSS, where as in the other you have POOLSIZE *
>>>> WSS.
>>>>
>>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <[email protected]> wrote:
>>>>
>>>> 
>>>> Hi Alan,
>>>>
>>>> Thanks for your reply and for mentioning JEP 444.
>>>> I’ve gone through the guidance in JEP 444 and have some understanding
>>>> of it — which is exactly why I’m feeling a bit puzzled in practice and
>>>> would really like to hear your thoughts.
>>>>
>>>> Background — ThreadLocal example (Aerospike)
>>>> ```java
>>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new
>>>> ThreadLocal<byte[]>() {
>>>>     @Override
>>>>     protected byte[] initialValue() {
>>>>         return new byte[DefaultBufferSize];
>>>>     }
>>>> };
>>>> ```
>>>> This Aerospike code allocates a default 8KB byte[] whenever a new
>>>> thread is created and stores it in a ThreadLocal for per-thread caching.
>>>>
>>>> My concern
>>>> - With a traditional platform-thread pool, those ThreadLocal byte[]
>>>> instances are effectively reused because threads are long-lived and pooled.
>>>> - If we switch to creating a brand-new virtual thread per task (no
>>>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], which
>>>> leads to many short-lived 8KB allocations.
>>>> - That raises allocation rate and GC pressure (despite collectors like
>>>> ZGC), because ThreadLocal caches aren’t reused when threads are ephemeral.
>>>>
>>>> So my question is: for applications originally designed around
>>>> platform-thread pools, wouldn’t partially pooling virtual threads be
>>>> beneficial? For example, Tomcat’s default max threads is 200 — if I keep a
>>>> pool of 200 virtual threads, then when load exceeds that core size, a
>>>> SynchronousQueue will naturally cause new virtual threads to be created on
>>>> demand. This seems to preserve the behavior that ThreadLocal-based
>>>> libraries expect, without losing the ability to expand under spikes. Since
>>>> virtual threads are very lightweight, pooling a reasonable number (e.g.,
>>>> 200) seems to have negligible memory downside while retaining ThreadLocal
>>>> cache effectiveness.
>>>>
>>>> Empirical test I ran
>>>> (I ran a microbenchmark comparing an unpooled per-task virtual-thread
>>>> executor and a ThreadPoolExecutor that keeps 200 core virtual threads.)
>>>>
>>>> ```java
>>>> public static void main(String[] args) throws InterruptedException {
>>>>     Executor executor =
>>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-",
>>>> 1).factory());
>>>>     Executor executor2 = new ThreadPoolExecutor(
>>>>         200,
>>>>         Integer.MAX_VALUE,
>>>>         0L,
>>>>         java.util.concurrent.TimeUnit.SECONDS,
>>>>         new SynchronousQueue<>(),
>>>>         Thread.ofVirtual().name("test-threadpool-", 1).factory()
>>>>     );
>>>>
>>>>     // Warm-up
>>>>     for (int i = 0; i < 10100; i++) {
>>>>         executor.execute(() -> {
>>>>             // simulate I/O wait
>>>>             try { Thread.sleep(100); } catch (InterruptedException e) {
>>>> throw new RuntimeException(e); }
>>>>         });
>>>>         executor2.execute(() -> {
>>>>             // simulate I/O wait
>>>>             try { Thread.sleep(100); } catch (InterruptedException e) {
>>>> throw new RuntimeException(e); }
>>>>         });
>>>>     }
>>>>
>>>>     // Ensure JIT + warm-up complete
>>>>     Thread.sleep(5000);
>>>>
>>>>     long start = System.currentTimeMillis();
>>>>     CountDownLatch countDownLatch = new CountDownLatch(50000);
>>>>     for (int i = 0; i < 50000; i++) {
>>>>         executor.execute(() -> {
>>>>             try { Thread.sleep(100); countDownLatch.countDown(); }
>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>>>>         });
>>>>     }
>>>>     countDownLatch.await();
>>>>     System.out.println("thread time: " + (System.currentTimeMillis() -
>>>> start) + " ms");
>>>>
>>>>     start = System.currentTimeMillis();
>>>>     CountDownLatch countDownLatch2 = new CountDownLatch(50000);
>>>>     for (int i = 0; i < 50000; i++) {
>>>>         executor2.execute(() -> {
>>>>             try { Thread.sleep(100); countDownLatch2.countDown(); }
>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>>>>         });
>>>>     }
>>>>     countDownLatch.await();
>>>>     System.out.println("thread pool time: " +
>>>> (System.currentTimeMillis() - start) + " ms");
>>>> }
>>>> ```
>>>>
>>>> Result summary
>>>> - In my runs, the pooled virtual-thread executor (executor2) performed
>>>> better than the unpooled per-task virtual-thread executor.
>>>> - Even when I increased load by 10x or 100x, the pooled virtual-thread
>>>> executor still showed better performance.
>>>> - In realistic workloads, it seems pooling some virtual threads reduces
>>>> allocation/GC overhead and improves throughput compared to strictly
>>>> unpooled virtual threads.
>>>>
>>>> Final thought / request for feedback
>>>> - From my perspective, for systems originally tuned for platform-thread
>>>> pools, partially pooling virtual threads seems to have no obvious downside
>>>> and can restore ThreadLocal cache effectiveness used by many third-party
>>>> libraries.
>>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread
>>>> semantics, or ThreadLocal behavior, please point out what I’m missing. I’d
>>>> appreciate your guidance.
>>>>
>>>> Best Regards.
>>>> Jianbin Chen, github-id: funky-eyes
>>>>
>>>> Alan Bateman <[email protected]> 于 2026年1月23日周五 17:27写道：
>>>>
>>>>> On 23/01/2026 07:30, Jianbin Chen wrote:
>>>>> > :
>>>>> >
>>>>> > So my question is:
>>>>> >
>>>>> > **In scenarios where third-party libraries heavily rely on
>>>>> ThreadLocal
>>>>> > for caching / buffering (and we cannot change those libraries to use
>>>>> > object pools instead), is explicitly pooling virtual threads (using
>>>>> a
>>>>> > ThreadPoolExecutor with virtual thread factory) considered a
>>>>> > recommended / acceptable workaround?**
>>>>> >
>>>>> > Or are there better / more idiomatic ways to handle this kind of
>>>>> > compatibility issue with legacy ThreadLocal-based libraries when
>>>>> > migrating to virtual threads?
>>>>> >
>>>>> > I have already opened a related discussion in the Dubbo project
>>>>> (since
>>>>> > Dubbo is one of the libraries affected in our stack):
>>>>> >
>>>>> > https://github.com/apache/dubbo/issues/16042
>>>>> >
>>>>> > Would love to hear your thoughts — especially from people who have
>>>>> > experience running large-scale virtual-thread-based services with
>>>>> > mixed third-party dependencies.
>>>>> >
>>>>>
>>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual
>>>>> threads
>>>>> and to avoid caching costing resources in thread locals. Virtual
>>>>> threads
>>>>> support thread locals of course but that is not useful when some
>>>>> library
>>>>> is looking to share a costly resource between tasks that run on the
>>>>> same
>>>>> thread in a thread pool.
>>>>>
>>>>> I don't know anything about Aerospike but working with the maintainers
>>>>> of that library to re-work its buffer management seems like the right
>>>>> course of action here. Your mail says "byte buffers". If this is
>>>>> ByteBuffer it might be that they are caching direct buffers as they
>>>>> are
>>>>> expensive to create (and managed by the GC). Maybe they could look at
>>>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory
>>>>> segment) and allocate from an arena that better matches the lifecycle.
>>>>>
>>>>> Hopefully others will share their experiences with migration as it is
>>>>> indeed challenging to migrate code developed for thread pools to work
>>>>> efficiently on virtual threads where there is 1-1 relationship between
>>>>> the task to execute and the thread.
>>>>>
>>>>> -Alan
>>>>>
>>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables
>>>>>
>>>>

Re: Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Reply via email to