Hi Francesco, I'd like to know if there's a similar issue in JDK 21?
Best Regards. Jianbin Chen, github-id: funky-eyes Francesco Nigro <[email protected]> 于 2026年1月23日周五 23:14写道: > In the original code snippet I see named (with a counter) VThreads, so, be > aware of https://bugs.openjdk.org/browse/JDK-8372410 > > Il giorno ven 23 gen 2026 alle ore 15:52 Jianbin Chen <[email protected]> > ha scritto: > >> I'm sorry — I forgot to mention the machine I used for the load test. My >> server is 2 cores and 4 GB RAM, and the JVM heap was set to 2880m. Under my >> test load (about 20,000 QPS), with non‑pooled virtual threads you generate >> at least 20,000 × 8 KB = ~156 MB of byte[] allocations per second just from >> that 8 KB buffer; that doesn't include other object allocations. With a >> 2880 MB heap this allocation rate already forces very frequent GC, and >> frequent GC raises CPU usage, which in turn significantly increases average >> response time and p99 / p999 latency. >> >> Pooling is usually introduced to solve performance issues — object pools >> and connection pools exist to quickly reuse cached resources and improve >> performance. So pooling virtual threads also yields obvious benefits, >> especially for memory‑constrained, I/O‑bound applications (gateways, >> proxies, etc.) that are sensitive to latency. >> >> Best Regards. >> Jianbin Chen, github-id: funky-eyes >> >> Robert Engels <[email protected]> 于 2026年1月23日周五 22:20写道: >> >>> I understand. I was trying explain how you can not use thread locals and >>> maintain the performance. It’s unlikely allocating a 8k buffer is a >>> performance bottleneck in a real program if the task is not cpu bound >>> (depending on the granularity you make your tasks) - but 2M tasks running >>> simultaneously would require 16gb of memory not including the stack. >>> >>> You cannot simply use the thread per task model without an understanding >>> of the cpu, IO, and memory footprints of your tasks and then configure >>> appropriately. >>> >>> On Jan 23, 2026, at 8:10 AM, Jianbin Chen <[email protected]> wrote: >>> >>> >>> I'm sorry, Robert—perhaps I didn't explain my example clearly enough. >>> Here's the code in question: >>> >>> ```java >>> Executor executor2 = new ThreadPoolExecutor( >>> 200, >>> Integer.MAX_VALUE, >>> 0L, >>> java.util.concurrent.TimeUnit.SECONDS, >>> new SynchronousQueue<>(), >>> Thread.ofVirtual().name("test-threadpool-", 1).factory() >>> ); >>> ``` >>> >>> In this example, the pooled virtual threads don't implement any >>> backpressure mechanism; they simply maintain a core pool of 200 virtual >>> threads. Given that the queue is a `SynchronousQueue` and the maximum pool >>> size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed 200, >>> its behavior becomes identical to that of non-pooled virtual threads. >>> >>> From my perspective, this example demonstrates that the benefits of >>> pooling virtual threads outweigh those of creating a new virtual thread for >>> every single task. In IO-bound scenarios, the virtual threads are directly >>> reused rather than being recreated each time, and the memory footprint of >>> virtual threads is far smaller than that of platform threads (which are >>> controlled by the `-Xss` flag). Additionally, with pooled virtual threads, >>> the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can >>> also be reused, which further reduces overall memory usage—wouldn't you >>> agree? >>> >>> Best Regards. >>> Jianbin Chen, github-id: funky-eyes >>> >>> Robert Engels <[email protected]> 于 2026年1月23日周五 21:52写道: >>> >>>> Because VT are so efficient to create, without any back pressure they >>>> will all be created and running at essentially the same time (dramatically >>>> raising the amount of memory in use) - versus with a pool of size N you >>>> will have at most N running at once. In a REAL WORLD application there are >>>> often external limiters (like number of tcp connections) that provide a >>>> limit. >>>> >>>> If your tasks are purely cpu bound you should probably be using a >>>> capped thread pool of platform threads as it makes no sense to have more >>>> threads than available cores. >>>> >>>> >>>> >>>> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <[email protected]> wrote: >>>> >>>> >>>> The question is why I need to use a semaphore to control the number of >>>> concurrently running tasks. In my particular example, the goal is simply to >>>> keep the concurrency level the same across different thread pool >>>> implementations so I can fairly compare which one completes all the tasks >>>> faster. This isn't solely about memory consumption—purely from a >>>> **performance** perspective (e.g., total throughput or wall-clock time to >>>> finish the workload), the same number of concurrent tasks completes >>>> noticeably faster when using pooled virtual threads. >>>> >>>> My email probably didn't explain this clearly enough. In reality, I >>>> have two main questions: >>>> >>>> 1. When a third-party library uses `ThreadLocal` as a cache/pool (e.g., >>>> to hold expensive reusable objects like connections, formatters, or >>>> parsers), is switching to a **pooled virtual thread executor** the only >>>> viable solution—assuming we cannot modify the third-party library code? >>>> >>>> 2. When running the exact same number of concurrent tasks, pooled >>>> virtual threads deliver better performance. >>>> >>>> Both questions point toward the same conclusion: for an application >>>> originally built around a traditional platform thread pool, after upgrading >>>> to JDK 21/25, moving to a **pooled virtual threads** approach is generally >>>> superior to simply using non-pooled (unbounded) virtual threads. >>>> >>>> If any part of this reasoning or conclusion is mistaken, I would really >>>> appreciate being corrected — thank you very much in advance for any >>>> feedback or different experiences you can share! >>>> >>>> Best Regards. >>>> Jianbin Chen, github-id: funky-eyes >>>> >>>> robert engels <[email protected]> 于 2026年1月23日周五 20:58写道: >>>> >>>>> Exactly, this is your problem. The total number of tasks will all be >>>>> running at once in the thread per task model. >>>>> >>>>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <[email protected]> wrote: >>>>> >>>>> >>>>> Hi Robert, >>>>> >>>>> Thanks you, but I'm a bit confused. In the example above, I only set >>>>> the core pool size to 200 virtual threads, but for the specific test case >>>>> we’re talking about, the concurrency isn’t actually being limited by the >>>>> pool size at all. Since the maximum thread count is Integer.MAX_VALUE and >>>>> it’s using a SynchronousQueue, tasks are handed off immediately and a new >>>>> thread gets created to run them right away anyway. >>>>> >>>>> Best Regards. >>>>> Jianbin Chen, github-id: funky-eyes >>>>> >>>>> robert engels <[email protected]> 于 2026年1月23日周五 20:28写道: >>>>> >>>>>> Try using a semaphore to limit the maximum number of tasks in >>>>>> progress at anyone time - that is what is causing your memory spike. >>>>>> Think >>>>>> of it this way since VT threads are so cheap to create - you are >>>>>> essentially creating them all at once - making the working set size >>>>>> equally >>>>>> to the maximum. So you have N * WSS, where as in the other you have >>>>>> POOLSIZE * WSS. >>>>>> >>>>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <[email protected]> wrote: >>>>>> >>>>>> >>>>>> Hi Alan, >>>>>> >>>>>> Thanks for your reply and for mentioning JEP 444. >>>>>> I’ve gone through the guidance in JEP 444 and have some understanding >>>>>> of it — which is exactly why I’m feeling a bit puzzled in practice and >>>>>> would really like to hear your thoughts. >>>>>> >>>>>> Background — ThreadLocal example (Aerospike) >>>>>> ```java >>>>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new >>>>>> ThreadLocal<byte[]>() { >>>>>> @Override >>>>>> protected byte[] initialValue() { >>>>>> return new byte[DefaultBufferSize]; >>>>>> } >>>>>> }; >>>>>> ``` >>>>>> This Aerospike code allocates a default 8KB byte[] whenever a new >>>>>> thread is created and stores it in a ThreadLocal for per-thread caching. >>>>>> >>>>>> My concern >>>>>> - With a traditional platform-thread pool, those ThreadLocal byte[] >>>>>> instances are effectively reused because threads are long-lived and >>>>>> pooled. >>>>>> - If we switch to creating a brand-new virtual thread per task (no >>>>>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], >>>>>> which >>>>>> leads to many short-lived 8KB allocations. >>>>>> - That raises allocation rate and GC pressure (despite collectors >>>>>> like ZGC), because ThreadLocal caches aren’t reused when threads are >>>>>> ephemeral. >>>>>> >>>>>> So my question is: for applications originally designed around >>>>>> platform-thread pools, wouldn’t partially pooling virtual threads be >>>>>> beneficial? For example, Tomcat’s default max threads is 200 — if I keep >>>>>> a >>>>>> pool of 200 virtual threads, then when load exceeds that core size, a >>>>>> SynchronousQueue will naturally cause new virtual threads to be created >>>>>> on >>>>>> demand. This seems to preserve the behavior that ThreadLocal-based >>>>>> libraries expect, without losing the ability to expand under spikes. >>>>>> Since >>>>>> virtual threads are very lightweight, pooling a reasonable number (e.g., >>>>>> 200) seems to have negligible memory downside while retaining ThreadLocal >>>>>> cache effectiveness. >>>>>> >>>>>> Empirical test I ran >>>>>> (I ran a microbenchmark comparing an unpooled per-task virtual-thread >>>>>> executor and a ThreadPoolExecutor that keeps 200 core virtual threads.) >>>>>> >>>>>> ```java >>>>>> public static void main(String[] args) throws InterruptedException { >>>>>> Executor executor = >>>>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-", >>>>>> 1).factory()); >>>>>> Executor executor2 = new ThreadPoolExecutor( >>>>>> 200, >>>>>> Integer.MAX_VALUE, >>>>>> 0L, >>>>>> java.util.concurrent.TimeUnit.SECONDS, >>>>>> new SynchronousQueue<>(), >>>>>> Thread.ofVirtual().name("test-threadpool-", 1).factory() >>>>>> ); >>>>>> >>>>>> // Warm-up >>>>>> for (int i = 0; i < 10100; i++) { >>>>>> executor.execute(() -> { >>>>>> // simulate I/O wait >>>>>> try { Thread.sleep(100); } catch (InterruptedException e) >>>>>> { throw new RuntimeException(e); } >>>>>> }); >>>>>> executor2.execute(() -> { >>>>>> // simulate I/O wait >>>>>> try { Thread.sleep(100); } catch (InterruptedException e) >>>>>> { throw new RuntimeException(e); } >>>>>> }); >>>>>> } >>>>>> >>>>>> // Ensure JIT + warm-up complete >>>>>> Thread.sleep(5000); >>>>>> >>>>>> long start = System.currentTimeMillis(); >>>>>> CountDownLatch countDownLatch = new CountDownLatch(50000); >>>>>> for (int i = 0; i < 50000; i++) { >>>>>> executor.execute(() -> { >>>>>> try { Thread.sleep(100); countDownLatch.countDown(); } >>>>>> catch (InterruptedException e) { throw new RuntimeException(e); } >>>>>> }); >>>>>> } >>>>>> countDownLatch.await(); >>>>>> System.out.println("thread time: " + (System.currentTimeMillis() >>>>>> - start) + " ms"); >>>>>> >>>>>> start = System.currentTimeMillis(); >>>>>> CountDownLatch countDownLatch2 = new CountDownLatch(50000); >>>>>> for (int i = 0; i < 50000; i++) { >>>>>> executor2.execute(() -> { >>>>>> try { Thread.sleep(100); countDownLatch2.countDown(); } >>>>>> catch (InterruptedException e) { throw new RuntimeException(e); } >>>>>> }); >>>>>> } >>>>>> countDownLatch.await(); >>>>>> System.out.println("thread pool time: " + >>>>>> (System.currentTimeMillis() - start) + " ms"); >>>>>> } >>>>>> ``` >>>>>> >>>>>> Result summary >>>>>> - In my runs, the pooled virtual-thread executor (executor2) >>>>>> performed better than the unpooled per-task virtual-thread executor. >>>>>> - Even when I increased load by 10x or 100x, the pooled >>>>>> virtual-thread executor still showed better performance. >>>>>> - In realistic workloads, it seems pooling some virtual threads >>>>>> reduces allocation/GC overhead and improves throughput compared to >>>>>> strictly >>>>>> unpooled virtual threads. >>>>>> >>>>>> Final thought / request for feedback >>>>>> - From my perspective, for systems originally tuned for >>>>>> platform-thread pools, partially pooling virtual threads seems to have no >>>>>> obvious downside and can restore ThreadLocal cache effectiveness used by >>>>>> many third-party libraries. >>>>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread >>>>>> semantics, or ThreadLocal behavior, please point out what I’m missing. >>>>>> I’d >>>>>> appreciate your guidance. >>>>>> >>>>>> Best Regards. >>>>>> Jianbin Chen, github-id: funky-eyes >>>>>> >>>>>> Alan Bateman <[email protected]> 于 2026年1月23日周五 17:27写道: >>>>>> >>>>>>> On 23/01/2026 07:30, Jianbin Chen wrote: >>>>>>> > : >>>>>>> > >>>>>>> > So my question is: >>>>>>> > >>>>>>> > **In scenarios where third-party libraries heavily rely on >>>>>>> ThreadLocal >>>>>>> > for caching / buffering (and we cannot change those libraries to >>>>>>> use >>>>>>> > object pools instead), is explicitly pooling virtual threads >>>>>>> (using a >>>>>>> > ThreadPoolExecutor with virtual thread factory) considered a >>>>>>> > recommended / acceptable workaround?** >>>>>>> > >>>>>>> > Or are there better / more idiomatic ways to handle this kind of >>>>>>> > compatibility issue with legacy ThreadLocal-based libraries when >>>>>>> > migrating to virtual threads? >>>>>>> > >>>>>>> > I have already opened a related discussion in the Dubbo project >>>>>>> (since >>>>>>> > Dubbo is one of the libraries affected in our stack): >>>>>>> > >>>>>>> > https://github.com/apache/dubbo/issues/16042 >>>>>>> > >>>>>>> > Would love to hear your thoughts — especially from people who have >>>>>>> > experience running large-scale virtual-thread-based services with >>>>>>> > mixed third-party dependencies. >>>>>>> > >>>>>>> >>>>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual >>>>>>> threads >>>>>>> and to avoid caching costing resources in thread locals. Virtual >>>>>>> threads >>>>>>> support thread locals of course but that is not useful when some >>>>>>> library >>>>>>> is looking to share a costly resource between tasks that run on the >>>>>>> same >>>>>>> thread in a thread pool. >>>>>>> >>>>>>> I don't know anything about Aerospike but working with the >>>>>>> maintainers >>>>>>> of that library to re-work its buffer management seems like the >>>>>>> right >>>>>>> course of action here. Your mail says "byte buffers". If this is >>>>>>> ByteBuffer it might be that they are caching direct buffers as they >>>>>>> are >>>>>>> expensive to create (and managed by the GC). Maybe they could look >>>>>>> at >>>>>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory >>>>>>> segment) and allocate from an arena that better matches the >>>>>>> lifecycle. >>>>>>> >>>>>>> Hopefully others will share their experiences with migration as it >>>>>>> is >>>>>>> indeed challenging to migrate code developed for thread pools to >>>>>>> work >>>>>>> efficiently on virtual threads where there is 1-1 relationship >>>>>>> between >>>>>>> the task to execute and the thread. >>>>>>> >>>>>>> -Alan >>>>>>> >>>>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables >>>>>>> >>>>>>
