The question is why I need to use a semaphore to control the number of concurrently running tasks. In my particular example, the goal is simply to keep the concurrency level the same across different thread pool implementations so I can fairly compare which one completes all the tasks faster. This isn't solely about memory consumption—purely from a **performance** perspective (e.g., total throughput or wall-clock time to finish the workload), the same number of concurrent tasks completes noticeably faster when using pooled virtual threads.
My email probably didn't explain this clearly enough. In reality, I have two main questions: 1. When a third-party library uses `ThreadLocal` as a cache/pool (e.g., to hold expensive reusable objects like connections, formatters, or parsers), is switching to a **pooled virtual thread executor** the only viable solution—assuming we cannot modify the third-party library code? 2. When running the exact same number of concurrent tasks, pooled virtual threads deliver better performance. Both questions point toward the same conclusion: for an application originally built around a traditional platform thread pool, after upgrading to JDK 21/25, moving to a **pooled virtual threads** approach is generally superior to simply using non-pooled (unbounded) virtual threads. If any part of this reasoning or conclusion is mistaken, I would really appreciate being corrected — thank you very much in advance for any feedback or different experiences you can share! Best Regards. Jianbin Chen, github-id: funky-eyes robert engels <[email protected]> 于 2026年1月23日周五 20:58写道: > Exactly, this is your problem. The total number of tasks will all be > running at once in the thread per task model. > > On Jan 23, 2026, at 6:49 AM, Jianbin Chen <[email protected]> wrote: > > > Hi Robert, > > Thanks you, but I'm a bit confused. In the example above, I only set the > core pool size to 200 virtual threads, but for the specific test case we’re > talking about, the concurrency isn’t actually being limited by the pool > size at all. Since the maximum thread count is Integer.MAX_VALUE and it’s > using a SynchronousQueue, tasks are handed off immediately and a new thread > gets created to run them right away anyway. > > Best Regards. > Jianbin Chen, github-id: funky-eyes > > robert engels <[email protected]> 于 2026年1月23日周五 20:28写道: > >> Try using a semaphore to limit the maximum number of tasks in progress at >> anyone time - that is what is causing your memory spike. Think of it this >> way since VT threads are so cheap to create - you are essentially creating >> them all at once - making the working set size equally to the maximum. So >> you have N * WSS, where as in the other you have POOLSIZE * WSS. >> >> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <[email protected]> wrote: >> >> >> Hi Alan, >> >> Thanks for your reply and for mentioning JEP 444. >> I’ve gone through the guidance in JEP 444 and have some understanding of >> it — which is exactly why I’m feeling a bit puzzled in practice and would >> really like to hear your thoughts. >> >> Background — ThreadLocal example (Aerospike) >> ```java >> private static final ThreadLocal<byte[]> BufferThreadLocal = new >> ThreadLocal<byte[]>() { >> @Override >> protected byte[] initialValue() { >> return new byte[DefaultBufferSize]; >> } >> }; >> ``` >> This Aerospike code allocates a default 8KB byte[] whenever a new thread >> is created and stores it in a ThreadLocal for per-thread caching. >> >> My concern >> - With a traditional platform-thread pool, those ThreadLocal byte[] >> instances are effectively reused because threads are long-lived and pooled. >> - If we switch to creating a brand-new virtual thread per task (no >> pooling), each virtual thread gets its own fresh ThreadLocal byte[], which >> leads to many short-lived 8KB allocations. >> - That raises allocation rate and GC pressure (despite collectors like >> ZGC), because ThreadLocal caches aren’t reused when threads are ephemeral. >> >> So my question is: for applications originally designed around >> platform-thread pools, wouldn’t partially pooling virtual threads be >> beneficial? For example, Tomcat’s default max threads is 200 — if I keep a >> pool of 200 virtual threads, then when load exceeds that core size, a >> SynchronousQueue will naturally cause new virtual threads to be created on >> demand. This seems to preserve the behavior that ThreadLocal-based >> libraries expect, without losing the ability to expand under spikes. Since >> virtual threads are very lightweight, pooling a reasonable number (e.g., >> 200) seems to have negligible memory downside while retaining ThreadLocal >> cache effectiveness. >> >> Empirical test I ran >> (I ran a microbenchmark comparing an unpooled per-task virtual-thread >> executor and a ThreadPoolExecutor that keeps 200 core virtual threads.) >> >> ```java >> public static void main(String[] args) throws InterruptedException { >> Executor executor = >> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-", >> 1).factory()); >> Executor executor2 = new ThreadPoolExecutor( >> 200, >> Integer.MAX_VALUE, >> 0L, >> java.util.concurrent.TimeUnit.SECONDS, >> new SynchronousQueue<>(), >> Thread.ofVirtual().name("test-threadpool-", 1).factory() >> ); >> >> // Warm-up >> for (int i = 0; i < 10100; i++) { >> executor.execute(() -> { >> // simulate I/O wait >> try { Thread.sleep(100); } catch (InterruptedException e) { >> throw new RuntimeException(e); } >> }); >> executor2.execute(() -> { >> // simulate I/O wait >> try { Thread.sleep(100); } catch (InterruptedException e) { >> throw new RuntimeException(e); } >> }); >> } >> >> // Ensure JIT + warm-up complete >> Thread.sleep(5000); >> >> long start = System.currentTimeMillis(); >> CountDownLatch countDownLatch = new CountDownLatch(50000); >> for (int i = 0; i < 50000; i++) { >> executor.execute(() -> { >> try { Thread.sleep(100); countDownLatch.countDown(); } catch >> (InterruptedException e) { throw new RuntimeException(e); } >> }); >> } >> countDownLatch.await(); >> System.out.println("thread time: " + (System.currentTimeMillis() - >> start) + " ms"); >> >> start = System.currentTimeMillis(); >> CountDownLatch countDownLatch2 = new CountDownLatch(50000); >> for (int i = 0; i < 50000; i++) { >> executor2.execute(() -> { >> try { Thread.sleep(100); countDownLatch2.countDown(); } catch >> (InterruptedException e) { throw new RuntimeException(e); } >> }); >> } >> countDownLatch.await(); >> System.out.println("thread pool time: " + (System.currentTimeMillis() >> - start) + " ms"); >> } >> ``` >> >> Result summary >> - In my runs, the pooled virtual-thread executor (executor2) performed >> better than the unpooled per-task virtual-thread executor. >> - Even when I increased load by 10x or 100x, the pooled virtual-thread >> executor still showed better performance. >> - In realistic workloads, it seems pooling some virtual threads reduces >> allocation/GC overhead and improves throughput compared to strictly >> unpooled virtual threads. >> >> Final thought / request for feedback >> - From my perspective, for systems originally tuned for platform-thread >> pools, partially pooling virtual threads seems to have no obvious downside >> and can restore ThreadLocal cache effectiveness used by many third-party >> libraries. >> - If I’ve misunderstood JEP 444 recommendations, virtual-thread >> semantics, or ThreadLocal behavior, please point out what I’m missing. I’d >> appreciate your guidance. >> >> Best Regards. >> Jianbin Chen, github-id: funky-eyes >> >> Alan Bateman <[email protected]> 于 2026年1月23日周五 17:27写道: >> >>> On 23/01/2026 07:30, Jianbin Chen wrote: >>> > : >>> > >>> > So my question is: >>> > >>> > **In scenarios where third-party libraries heavily rely on ThreadLocal >>> > for caching / buffering (and we cannot change those libraries to use >>> > object pools instead), is explicitly pooling virtual threads (using a >>> > ThreadPoolExecutor with virtual thread factory) considered a >>> > recommended / acceptable workaround?** >>> > >>> > Or are there better / more idiomatic ways to handle this kind of >>> > compatibility issue with legacy ThreadLocal-based libraries when >>> > migrating to virtual threads? >>> > >>> > I have already opened a related discussion in the Dubbo project (since >>> > Dubbo is one of the libraries affected in our stack): >>> > >>> > https://github.com/apache/dubbo/issues/16042 >>> > >>> > Would love to hear your thoughts — especially from people who have >>> > experience running large-scale virtual-thread-based services with >>> > mixed third-party dependencies. >>> > >>> >>> The guidelines that we put in JEP 444 [1] is to not pool virtual threads >>> and to avoid caching costing resources in thread locals. Virtual threads >>> support thread locals of course but that is not useful when some library >>> is looking to share a costly resource between tasks that run on the same >>> thread in a thread pool. >>> >>> I don't know anything about Aerospike but working with the maintainers >>> of that library to re-work its buffer management seems like the right >>> course of action here. Your mail says "byte buffers". If this is >>> ByteBuffer it might be that they are caching direct buffers as they are >>> expensive to create (and managed by the GC). Maybe they could look at >>> using MemorySegment (it's easy to get a ByteBuffer view of a memory >>> segment) and allocate from an arena that better matches the lifecycle. >>> >>> Hopefully others will share their experiences with migration as it is >>> indeed challenging to migrate code developed for thread pools to work >>> efficiently on virtual threads where there is 1-1 relationship between >>> the task to execute and the thread. >>> >>> -Alan >>> >>> [1] https://openjdk.org/jeps/444#Thread-local-variables >>> >>
