I'm sorry — I forgot to mention the machine I used for the load test. My server is 2 cores and 4 GB RAM, and the JVM heap was set to 2880m. Under my test load (about 20,000 QPS), with non‑pooled virtual threads you generate at least 20,000 × 8 KB = ~156 MB of byte[] allocations per second just from that 8 KB buffer; that doesn't include other object allocations. With a 2880 MB heap this allocation rate already forces very frequent GC, and frequent GC raises CPU usage, which in turn significantly increases average response time and p99 / p999 latency.
Pooling is usually introduced to solve performance issues — object pools and connection pools exist to quickly reuse cached resources and improve performance. So pooling virtual threads also yields obvious benefits, especially for memory‑constrained, I/O‑bound applications (gateways, proxies, etc.) that are sensitive to latency. Best Regards. Jianbin Chen, github-id: funky-eyes Robert Engels <[email protected]> 于 2026年1月23日周五 22:20写道: > I understand. I was trying explain how you can not use thread locals and > maintain the performance. It’s unlikely allocating a 8k buffer is a > performance bottleneck in a real program if the task is not cpu bound > (depending on the granularity you make your tasks) - but 2M tasks running > simultaneously would require 16gb of memory not including the stack. > > You cannot simply use the thread per task model without an understanding > of the cpu, IO, and memory footprints of your tasks and then configure > appropriately. > > On Jan 23, 2026, at 8:10 AM, Jianbin Chen <[email protected]> wrote: > > > I'm sorry, Robert—perhaps I didn't explain my example clearly enough. > Here's the code in question: > > ```java > Executor executor2 = new ThreadPoolExecutor( > 200, > Integer.MAX_VALUE, > 0L, > java.util.concurrent.TimeUnit.SECONDS, > new SynchronousQueue<>(), > Thread.ofVirtual().name("test-threadpool-", 1).factory() > ); > ``` > > In this example, the pooled virtual threads don't implement any > backpressure mechanism; they simply maintain a core pool of 200 virtual > threads. Given that the queue is a `SynchronousQueue` and the maximum pool > size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed 200, > its behavior becomes identical to that of non-pooled virtual threads. > > From my perspective, this example demonstrates that the benefits of > pooling virtual threads outweigh those of creating a new virtual thread for > every single task. In IO-bound scenarios, the virtual threads are directly > reused rather than being recreated each time, and the memory footprint of > virtual threads is far smaller than that of platform threads (which are > controlled by the `-Xss` flag). Additionally, with pooled virtual threads, > the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can > also be reused, which further reduces overall memory usage—wouldn't you > agree? > > Best Regards. > Jianbin Chen, github-id: funky-eyes > > Robert Engels <[email protected]> 于 2026年1月23日周五 21:52写道: > >> Because VT are so efficient to create, without any back pressure they >> will all be created and running at essentially the same time (dramatically >> raising the amount of memory in use) - versus with a pool of size N you >> will have at most N running at once. In a REAL WORLD application there are >> often external limiters (like number of tcp connections) that provide a >> limit. >> >> If your tasks are purely cpu bound you should probably be using a capped >> thread pool of platform threads as it makes no sense to have more threads >> than available cores. >> >> >> >> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <[email protected]> wrote: >> >> >> The question is why I need to use a semaphore to control the number of >> concurrently running tasks. In my particular example, the goal is simply to >> keep the concurrency level the same across different thread pool >> implementations so I can fairly compare which one completes all the tasks >> faster. This isn't solely about memory consumption—purely from a >> **performance** perspective (e.g., total throughput or wall-clock time to >> finish the workload), the same number of concurrent tasks completes >> noticeably faster when using pooled virtual threads. >> >> My email probably didn't explain this clearly enough. In reality, I have >> two main questions: >> >> 1. When a third-party library uses `ThreadLocal` as a cache/pool (e.g., >> to hold expensive reusable objects like connections, formatters, or >> parsers), is switching to a **pooled virtual thread executor** the only >> viable solution—assuming we cannot modify the third-party library code? >> >> 2. When running the exact same number of concurrent tasks, pooled virtual >> threads deliver better performance. >> >> Both questions point toward the same conclusion: for an application >> originally built around a traditional platform thread pool, after upgrading >> to JDK 21/25, moving to a **pooled virtual threads** approach is generally >> superior to simply using non-pooled (unbounded) virtual threads. >> >> If any part of this reasoning or conclusion is mistaken, I would really >> appreciate being corrected — thank you very much in advance for any >> feedback or different experiences you can share! >> >> Best Regards. >> Jianbin Chen, github-id: funky-eyes >> >> robert engels <[email protected]> 于 2026年1月23日周五 20:58写道: >> >>> Exactly, this is your problem. The total number of tasks will all be >>> running at once in the thread per task model. >>> >>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <[email protected]> wrote: >>> >>> >>> Hi Robert, >>> >>> Thanks you, but I'm a bit confused. In the example above, I only set the >>> core pool size to 200 virtual threads, but for the specific test case we’re >>> talking about, the concurrency isn’t actually being limited by the pool >>> size at all. Since the maximum thread count is Integer.MAX_VALUE and it’s >>> using a SynchronousQueue, tasks are handed off immediately and a new thread >>> gets created to run them right away anyway. >>> >>> Best Regards. >>> Jianbin Chen, github-id: funky-eyes >>> >>> robert engels <[email protected]> 于 2026年1月23日周五 20:28写道: >>> >>>> Try using a semaphore to limit the maximum number of tasks in progress >>>> at anyone time - that is what is causing your memory spike. Think of it >>>> this way since VT threads are so cheap to create - you are essentially >>>> creating them all at once - making the working set size equally to the >>>> maximum. So you have N * WSS, where as in the other you have POOLSIZE * >>>> WSS. >>>> >>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <[email protected]> wrote: >>>> >>>> >>>> Hi Alan, >>>> >>>> Thanks for your reply and for mentioning JEP 444. >>>> I’ve gone through the guidance in JEP 444 and have some understanding >>>> of it — which is exactly why I’m feeling a bit puzzled in practice and >>>> would really like to hear your thoughts. >>>> >>>> Background — ThreadLocal example (Aerospike) >>>> ```java >>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new >>>> ThreadLocal<byte[]>() { >>>> @Override >>>> protected byte[] initialValue() { >>>> return new byte[DefaultBufferSize]; >>>> } >>>> }; >>>> ``` >>>> This Aerospike code allocates a default 8KB byte[] whenever a new >>>> thread is created and stores it in a ThreadLocal for per-thread caching. >>>> >>>> My concern >>>> - With a traditional platform-thread pool, those ThreadLocal byte[] >>>> instances are effectively reused because threads are long-lived and pooled. >>>> - If we switch to creating a brand-new virtual thread per task (no >>>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], which >>>> leads to many short-lived 8KB allocations. >>>> - That raises allocation rate and GC pressure (despite collectors like >>>> ZGC), because ThreadLocal caches aren’t reused when threads are ephemeral. >>>> >>>> So my question is: for applications originally designed around >>>> platform-thread pools, wouldn’t partially pooling virtual threads be >>>> beneficial? For example, Tomcat’s default max threads is 200 — if I keep a >>>> pool of 200 virtual threads, then when load exceeds that core size, a >>>> SynchronousQueue will naturally cause new virtual threads to be created on >>>> demand. This seems to preserve the behavior that ThreadLocal-based >>>> libraries expect, without losing the ability to expand under spikes. Since >>>> virtual threads are very lightweight, pooling a reasonable number (e.g., >>>> 200) seems to have negligible memory downside while retaining ThreadLocal >>>> cache effectiveness. >>>> >>>> Empirical test I ran >>>> (I ran a microbenchmark comparing an unpooled per-task virtual-thread >>>> executor and a ThreadPoolExecutor that keeps 200 core virtual threads.) >>>> >>>> ```java >>>> public static void main(String[] args) throws InterruptedException { >>>> Executor executor = >>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-", >>>> 1).factory()); >>>> Executor executor2 = new ThreadPoolExecutor( >>>> 200, >>>> Integer.MAX_VALUE, >>>> 0L, >>>> java.util.concurrent.TimeUnit.SECONDS, >>>> new SynchronousQueue<>(), >>>> Thread.ofVirtual().name("test-threadpool-", 1).factory() >>>> ); >>>> >>>> // Warm-up >>>> for (int i = 0; i < 10100; i++) { >>>> executor.execute(() -> { >>>> // simulate I/O wait >>>> try { Thread.sleep(100); } catch (InterruptedException e) { >>>> throw new RuntimeException(e); } >>>> }); >>>> executor2.execute(() -> { >>>> // simulate I/O wait >>>> try { Thread.sleep(100); } catch (InterruptedException e) { >>>> throw new RuntimeException(e); } >>>> }); >>>> } >>>> >>>> // Ensure JIT + warm-up complete >>>> Thread.sleep(5000); >>>> >>>> long start = System.currentTimeMillis(); >>>> CountDownLatch countDownLatch = new CountDownLatch(50000); >>>> for (int i = 0; i < 50000; i++) { >>>> executor.execute(() -> { >>>> try { Thread.sleep(100); countDownLatch.countDown(); } >>>> catch (InterruptedException e) { throw new RuntimeException(e); } >>>> }); >>>> } >>>> countDownLatch.await(); >>>> System.out.println("thread time: " + (System.currentTimeMillis() - >>>> start) + " ms"); >>>> >>>> start = System.currentTimeMillis(); >>>> CountDownLatch countDownLatch2 = new CountDownLatch(50000); >>>> for (int i = 0; i < 50000; i++) { >>>> executor2.execute(() -> { >>>> try { Thread.sleep(100); countDownLatch2.countDown(); } >>>> catch (InterruptedException e) { throw new RuntimeException(e); } >>>> }); >>>> } >>>> countDownLatch.await(); >>>> System.out.println("thread pool time: " + >>>> (System.currentTimeMillis() - start) + " ms"); >>>> } >>>> ``` >>>> >>>> Result summary >>>> - In my runs, the pooled virtual-thread executor (executor2) performed >>>> better than the unpooled per-task virtual-thread executor. >>>> - Even when I increased load by 10x or 100x, the pooled virtual-thread >>>> executor still showed better performance. >>>> - In realistic workloads, it seems pooling some virtual threads reduces >>>> allocation/GC overhead and improves throughput compared to strictly >>>> unpooled virtual threads. >>>> >>>> Final thought / request for feedback >>>> - From my perspective, for systems originally tuned for platform-thread >>>> pools, partially pooling virtual threads seems to have no obvious downside >>>> and can restore ThreadLocal cache effectiveness used by many third-party >>>> libraries. >>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread >>>> semantics, or ThreadLocal behavior, please point out what I’m missing. I’d >>>> appreciate your guidance. >>>> >>>> Best Regards. >>>> Jianbin Chen, github-id: funky-eyes >>>> >>>> Alan Bateman <[email protected]> 于 2026年1月23日周五 17:27写道: >>>> >>>>> On 23/01/2026 07:30, Jianbin Chen wrote: >>>>> > : >>>>> > >>>>> > So my question is: >>>>> > >>>>> > **In scenarios where third-party libraries heavily rely on >>>>> ThreadLocal >>>>> > for caching / buffering (and we cannot change those libraries to use >>>>> > object pools instead), is explicitly pooling virtual threads (using >>>>> a >>>>> > ThreadPoolExecutor with virtual thread factory) considered a >>>>> > recommended / acceptable workaround?** >>>>> > >>>>> > Or are there better / more idiomatic ways to handle this kind of >>>>> > compatibility issue with legacy ThreadLocal-based libraries when >>>>> > migrating to virtual threads? >>>>> > >>>>> > I have already opened a related discussion in the Dubbo project >>>>> (since >>>>> > Dubbo is one of the libraries affected in our stack): >>>>> > >>>>> > https://github.com/apache/dubbo/issues/16042 >>>>> > >>>>> > Would love to hear your thoughts — especially from people who have >>>>> > experience running large-scale virtual-thread-based services with >>>>> > mixed third-party dependencies. >>>>> > >>>>> >>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual >>>>> threads >>>>> and to avoid caching costing resources in thread locals. Virtual >>>>> threads >>>>> support thread locals of course but that is not useful when some >>>>> library >>>>> is looking to share a costly resource between tasks that run on the >>>>> same >>>>> thread in a thread pool. >>>>> >>>>> I don't know anything about Aerospike but working with the maintainers >>>>> of that library to re-work its buffer management seems like the right >>>>> course of action here. Your mail says "byte buffers". If this is >>>>> ByteBuffer it might be that they are caching direct buffers as they >>>>> are >>>>> expensive to create (and managed by the GC). Maybe they could look at >>>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory >>>>> segment) and allocate from an arena that better matches the lifecycle. >>>>> >>>>> Hopefully others will share their experiences with migration as it is >>>>> indeed challenging to migrate code developed for thread pools to work >>>>> efficiently on virtual threads where there is 1-1 relationship between >>>>> the task to execute and the thread. >>>>> >>>>> -Alan >>>>> >>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables >>>>> >>>>
