Re: Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Francesco Nigro Fri, 23 Jan 2026 07:38:55 -0800

I would say, yes:
https://github.com/openjdk/jdk21/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/java.base/share/classes/java/lang/ThreadBuilders.java#L317
unless the fix will be backported - surely @Andrew Haley
<[email protected]> or @Alan Bateman <[email protected]>
 knows


Il giorno ven 23 gen 2026 alle ore 16:32 Jianbin Chen <[email protected]>
ha scritto:

> Hi Francesco,
>
> I'd like to know if there's a similar issue in JDK 21？
>
> Best Regards.
> Jianbin Chen, github-id: funky-eyes
>
> Francesco Nigro <[email protected]> 于 2026年1月23日周五 23:14写道：
>
>> In the original code snippet I see named (with a counter) VThreads, so,
>> be aware of https://bugs.openjdk.org/browse/JDK-8372410
>>
>> Il giorno ven 23 gen 2026 alle ore 15:52 Jianbin Chen <[email protected]>
>> ha scritto:
>>
>>> I'm sorry — I forgot to mention the machine I used for the load test. My
>>> server is 2 cores and 4 GB RAM, and the JVM heap was set to 2880m. Under my
>>> test load (about 20,000 QPS), with non‑pooled virtual threads you generate
>>> at least 20,000 × 8 KB = ~156 MB of byte[] allocations per second just from
>>> that 8 KB buffer; that doesn't include other object allocations. With a
>>> 2880 MB heap this allocation rate already forces very frequent GC, and
>>> frequent GC raises CPU usage, which in turn significantly increases average
>>> response time and p99 / p999 latency.
>>>
>>> Pooling is usually introduced to solve performance issues — object pools
>>> and connection pools exist to quickly reuse cached resources and improve
>>> performance. So pooling virtual threads also yields obvious benefits,
>>> especially for memory‑constrained, I/O‑bound applications (gateways,
>>> proxies, etc.) that are sensitive to latency.
>>>
>>> Best Regards.
>>> Jianbin Chen, github-id: funky-eyes
>>>
>>> Robert Engels <[email protected]> 于 2026年1月23日周五 22:20写道：
>>>
>>>> I understand. I was trying explain how you can not use thread locals
>>>> and maintain the performance. It’s unlikely allocating a 8k buffer is a
>>>> performance bottleneck in a real program if the task is not cpu bound
>>>> (depending on the granularity you make your tasks) - but 2M tasks running
>>>> simultaneously would require 16gb of memory not including the stack.
>>>>
>>>> You cannot simply use the thread per task model without an
>>>> understanding of the cpu, IO, and memory footprints of your tasks and then
>>>> configure appropriately.
>>>>
>>>> On Jan 23, 2026, at 8:10 AM, Jianbin Chen <[email protected]> wrote:
>>>>
>>>> 
>>>> I'm sorry, Robert—perhaps I didn't explain my example clearly enough.
>>>> Here's the code in question:
>>>>
>>>> ```java
>>>> Executor executor2 = new ThreadPoolExecutor(
>>>>     200,
>>>>     Integer.MAX_VALUE,
>>>>     0L,
>>>>     java.util.concurrent.TimeUnit.SECONDS,
>>>>     new SynchronousQueue<>(),
>>>>     Thread.ofVirtual().name("test-threadpool-", 1).factory()
>>>> );
>>>> ```
>>>>
>>>> In this example, the pooled virtual threads don't implement any
>>>> backpressure mechanism; they simply maintain a core pool of 200 virtual
>>>> threads. Given that the queue is a `SynchronousQueue` and the maximum pool
>>>> size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed 200,
>>>> its behavior becomes identical to that of non-pooled virtual threads.
>>>>
>>>> From my perspective, this example demonstrates that the benefits of
>>>> pooling virtual threads outweigh those of creating a new virtual thread for
>>>> every single task. In IO-bound scenarios, the virtual threads are directly
>>>> reused rather than being recreated each time, and the memory footprint of
>>>> virtual threads is far smaller than that of platform threads (which are
>>>> controlled by the `-Xss` flag). Additionally, with pooled virtual threads,
>>>> the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can
>>>> also be reused, which further reduces overall memory usage—wouldn't you
>>>> agree?
>>>>
>>>> Best Regards.
>>>> Jianbin Chen, github-id: funky-eyes
>>>>
>>>> Robert Engels <[email protected]> 于 2026年1月23日周五 21:52写道：
>>>>
>>>>> Because VT are so efficient to create, without any back pressure they
>>>>> will all be created and running at essentially the same time (dramatically
>>>>> raising the amount of memory in use) - versus with a pool of size N you
>>>>> will have at most N running at once. In a REAL WORLD application there are
>>>>> often external limiters (like number of tcp connections) that provide a
>>>>> limit.
>>>>>
>>>>> If your tasks are purely cpu bound you should probably be using a
>>>>> capped thread pool of platform threads as it makes no sense to have more
>>>>> threads than available cores.
>>>>>
>>>>>
>>>>>
>>>>> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <[email protected]> wrote:
>>>>>
>>>>> 
>>>>> The question is why I need to use a semaphore to control the number of
>>>>> concurrently running tasks. In my particular example, the goal is simply 
>>>>> to
>>>>> keep the concurrency level the same across different thread pool
>>>>> implementations so I can fairly compare which one completes all the tasks
>>>>> faster. This isn't solely about memory consumption—purely from a
>>>>> **performance** perspective (e.g., total throughput or wall-clock time to
>>>>> finish the workload), the same number of concurrent tasks completes
>>>>> noticeably faster when using pooled virtual threads.
>>>>>
>>>>> My email probably didn't explain this clearly enough. In reality, I
>>>>> have two main questions:
>>>>>
>>>>> 1. When a third-party library uses `ThreadLocal` as a cache/pool
>>>>> (e.g., to hold expensive reusable objects like connections, formatters, or
>>>>> parsers), is switching to a **pooled virtual thread executor** the only
>>>>> viable solution—assuming we cannot modify the third-party library code?
>>>>>
>>>>> 2. When running the exact same number of concurrent tasks, pooled
>>>>> virtual threads deliver better performance.
>>>>>
>>>>> Both questions point toward the same conclusion: for an application
>>>>> originally built around a traditional platform thread pool, after 
>>>>> upgrading
>>>>> to JDK 21/25, moving to a **pooled virtual threads** approach is generally
>>>>> superior to simply using non-pooled (unbounded) virtual threads.
>>>>>
>>>>> If any part of this reasoning or conclusion is mistaken, I would
>>>>> really appreciate being corrected — thank you very much in advance for any
>>>>> feedback or different experiences you can share!
>>>>>
>>>>> Best Regards.
>>>>> Jianbin Chen, github-id: funky-eyes
>>>>>
>>>>> robert engels <[email protected]> 于 2026年1月23日周五 20:58写道：
>>>>>
>>>>>> Exactly, this is your problem. The total number of tasks will all be
>>>>>> running at once in the thread per task model.
>>>>>>
>>>>>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <[email protected]> wrote:
>>>>>>
>>>>>> 
>>>>>> Hi Robert,
>>>>>>
>>>>>> Thanks you, but I'm a bit confused. In the example above, I only set
>>>>>> the core pool size to 200 virtual threads, but for the specific test case
>>>>>> we’re talking about, the concurrency isn’t actually being limited by the
>>>>>> pool size at all. Since the maximum thread count is Integer.MAX_VALUE and
>>>>>> it’s using a SynchronousQueue, tasks are handed off immediately and a new
>>>>>> thread gets created to run them right away anyway.
>>>>>>
>>>>>> Best Regards.
>>>>>> Jianbin Chen, github-id: funky-eyes
>>>>>>
>>>>>> robert engels <[email protected]> 于 2026年1月23日周五 20:28写道：
>>>>>>
>>>>>>> Try using a semaphore to limit the maximum number of tasks in
>>>>>>> progress at anyone time - that is what is causing your memory spike. 
>>>>>>> Think
>>>>>>> of it this way since VT threads are so cheap to create - you are
>>>>>>> essentially creating them all at once - making the working set size 
>>>>>>> equally
>>>>>>> to the maximum.  So you have N * WSS, where as in the other you have
>>>>>>> POOLSIZE * WSS.
>>>>>>>
>>>>>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> 
>>>>>>> Hi Alan,
>>>>>>>
>>>>>>> Thanks for your reply and for mentioning JEP 444.
>>>>>>> I’ve gone through the guidance in JEP 444 and have some
>>>>>>> understanding of it — which is exactly why I’m feeling a bit puzzled in
>>>>>>> practice and would really like to hear your thoughts.
>>>>>>>
>>>>>>> Background — ThreadLocal example (Aerospike)
>>>>>>> ```java
>>>>>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new
>>>>>>> ThreadLocal<byte[]>() {
>>>>>>>     @Override
>>>>>>>     protected byte[] initialValue() {
>>>>>>>         return new byte[DefaultBufferSize];
>>>>>>>     }
>>>>>>> };
>>>>>>> ```
>>>>>>> This Aerospike code allocates a default 8KB byte[] whenever a new
>>>>>>> thread is created and stores it in a ThreadLocal for per-thread caching.
>>>>>>>
>>>>>>> My concern
>>>>>>> - With a traditional platform-thread pool, those ThreadLocal byte[]
>>>>>>> instances are effectively reused because threads are long-lived and 
>>>>>>> pooled.
>>>>>>> - If we switch to creating a brand-new virtual thread per task (no
>>>>>>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], 
>>>>>>> which
>>>>>>> leads to many short-lived 8KB allocations.
>>>>>>> - That raises allocation rate and GC pressure (despite collectors
>>>>>>> like ZGC), because ThreadLocal caches aren’t reused when threads are
>>>>>>> ephemeral.
>>>>>>>
>>>>>>> So my question is: for applications originally designed around
>>>>>>> platform-thread pools, wouldn’t partially pooling virtual threads be
>>>>>>> beneficial? For example, Tomcat’s default max threads is 200 — if I 
>>>>>>> keep a
>>>>>>> pool of 200 virtual threads, then when load exceeds that core size, a
>>>>>>> SynchronousQueue will naturally cause new virtual threads to be created 
>>>>>>> on
>>>>>>> demand. This seems to preserve the behavior that ThreadLocal-based
>>>>>>> libraries expect, without losing the ability to expand under spikes. 
>>>>>>> Since
>>>>>>> virtual threads are very lightweight, pooling a reasonable number (e.g.,
>>>>>>> 200) seems to have negligible memory downside while retaining 
>>>>>>> ThreadLocal
>>>>>>> cache effectiveness.
>>>>>>>
>>>>>>> Empirical test I ran
>>>>>>> (I ran a microbenchmark comparing an unpooled per-task
>>>>>>> virtual-thread executor and a ThreadPoolExecutor that keeps 200 core
>>>>>>> virtual threads.)
>>>>>>>
>>>>>>> ```java
>>>>>>> public static void main(String[] args) throws InterruptedException {
>>>>>>>     Executor executor =
>>>>>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-",
>>>>>>> 1).factory());
>>>>>>>     Executor executor2 = new ThreadPoolExecutor(
>>>>>>>         200,
>>>>>>>         Integer.MAX_VALUE,
>>>>>>>         0L,
>>>>>>>         java.util.concurrent.TimeUnit.SECONDS,
>>>>>>>         new SynchronousQueue<>(),
>>>>>>>         Thread.ofVirtual().name("test-threadpool-", 1).factory()
>>>>>>>     );
>>>>>>>
>>>>>>>     // Warm-up
>>>>>>>     for (int i = 0; i < 10100; i++) {
>>>>>>>         executor.execute(() -> {
>>>>>>>             // simulate I/O wait
>>>>>>>             try { Thread.sleep(100); } catch (InterruptedException
>>>>>>> e) { throw new RuntimeException(e); }
>>>>>>>         });
>>>>>>>         executor2.execute(() -> {
>>>>>>>             // simulate I/O wait
>>>>>>>             try { Thread.sleep(100); } catch (InterruptedException
>>>>>>> e) { throw new RuntimeException(e); }
>>>>>>>         });
>>>>>>>     }
>>>>>>>
>>>>>>>     // Ensure JIT + warm-up complete
>>>>>>>     Thread.sleep(5000);
>>>>>>>
>>>>>>>     long start = System.currentTimeMillis();
>>>>>>>     CountDownLatch countDownLatch = new CountDownLatch(50000);
>>>>>>>     for (int i = 0; i < 50000; i++) {
>>>>>>>         executor.execute(() -> {
>>>>>>>             try { Thread.sleep(100); countDownLatch.countDown(); }
>>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>>>>>>>         });
>>>>>>>     }
>>>>>>>     countDownLatch.await();
>>>>>>>     System.out.println("thread time: " + (System.currentTimeMillis()
>>>>>>> - start) + " ms");
>>>>>>>
>>>>>>>     start = System.currentTimeMillis();
>>>>>>>     CountDownLatch countDownLatch2 = new CountDownLatch(50000);
>>>>>>>     for (int i = 0; i < 50000; i++) {
>>>>>>>         executor2.execute(() -> {
>>>>>>>             try { Thread.sleep(100); countDownLatch2.countDown(); }
>>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>>>>>>>         });
>>>>>>>     }
>>>>>>>     countDownLatch.await();
>>>>>>>     System.out.println("thread pool time: " +
>>>>>>> (System.currentTimeMillis() - start) + " ms");
>>>>>>> }
>>>>>>> ```
>>>>>>>
>>>>>>> Result summary
>>>>>>> - In my runs, the pooled virtual-thread executor (executor2)
>>>>>>> performed better than the unpooled per-task virtual-thread executor.
>>>>>>> - Even when I increased load by 10x or 100x, the pooled
>>>>>>> virtual-thread executor still showed better performance.
>>>>>>> - In realistic workloads, it seems pooling some virtual threads
>>>>>>> reduces allocation/GC overhead and improves throughput compared to 
>>>>>>> strictly
>>>>>>> unpooled virtual threads.
>>>>>>>
>>>>>>> Final thought / request for feedback
>>>>>>> - From my perspective, for systems originally tuned for
>>>>>>> platform-thread pools, partially pooling virtual threads seems to have 
>>>>>>> no
>>>>>>> obvious downside and can restore ThreadLocal cache effectiveness used by
>>>>>>> many third-party libraries.
>>>>>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread
>>>>>>> semantics, or ThreadLocal behavior, please point out what I’m missing. 
>>>>>>> I’d
>>>>>>> appreciate your guidance.
>>>>>>>
>>>>>>> Best Regards.
>>>>>>> Jianbin Chen, github-id: funky-eyes
>>>>>>>
>>>>>>> Alan Bateman <[email protected]> 于 2026年1月23日周五 17:27写道：
>>>>>>>
>>>>>>>> On 23/01/2026 07:30, Jianbin Chen wrote:
>>>>>>>> > :
>>>>>>>> >
>>>>>>>> > So my question is:
>>>>>>>> >
>>>>>>>> > **In scenarios where third-party libraries heavily rely on
>>>>>>>> ThreadLocal
>>>>>>>> > for caching / buffering (and we cannot change those libraries to
>>>>>>>> use
>>>>>>>> > object pools instead), is explicitly pooling virtual threads
>>>>>>>> (using a
>>>>>>>> > ThreadPoolExecutor with virtual thread factory) considered a
>>>>>>>> > recommended / acceptable workaround?**
>>>>>>>> >
>>>>>>>> > Or are there better / more idiomatic ways to handle this kind of
>>>>>>>> > compatibility issue with legacy ThreadLocal-based libraries when
>>>>>>>> > migrating to virtual threads?
>>>>>>>> >
>>>>>>>> > I have already opened a related discussion in the Dubbo project
>>>>>>>> (since
>>>>>>>> > Dubbo is one of the libraries affected in our stack):
>>>>>>>> >
>>>>>>>> > https://github.com/apache/dubbo/issues/16042
>>>>>>>> >
>>>>>>>> > Would love to hear your thoughts — especially from people who
>>>>>>>> have
>>>>>>>> > experience running large-scale virtual-thread-based services with
>>>>>>>> > mixed third-party dependencies.
>>>>>>>> >
>>>>>>>>
>>>>>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual
>>>>>>>> threads
>>>>>>>> and to avoid caching costing resources in thread locals. Virtual
>>>>>>>> threads
>>>>>>>> support thread locals of course but that is not useful when some
>>>>>>>> library
>>>>>>>> is looking to share a costly resource between tasks that run on the
>>>>>>>> same
>>>>>>>> thread in a thread pool.
>>>>>>>>
>>>>>>>> I don't know anything about Aerospike but working with the
>>>>>>>> maintainers
>>>>>>>> of that library to re-work its buffer management seems like the
>>>>>>>> right
>>>>>>>> course of action here. Your mail says "byte buffers". If this is
>>>>>>>> ByteBuffer it might be that they are caching direct buffers as they
>>>>>>>> are
>>>>>>>> expensive to create (and managed by the GC). Maybe they could look
>>>>>>>> at
>>>>>>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory
>>>>>>>> segment) and allocate from an arena that better matches the
>>>>>>>> lifecycle.
>>>>>>>>
>>>>>>>> Hopefully others will share their experiences with migration as it
>>>>>>>> is
>>>>>>>> indeed challenging to migrate code developed for thread pools to
>>>>>>>> work
>>>>>>>> efficiently on virtual threads where there is 1-1 relationship
>>>>>>>> between
>>>>>>>> the task to execute and the thread.
>>>>>>>>
>>>>>>>> -Alan
>>>>>>>>
>>>>>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables
>>>>>>>>
>>>>>>>

Re: Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Reply via email to