I would say, yes: https://github.com/openjdk/jdk21/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/java.base/share/classes/java/lang/ThreadBuilders.java#L317 unless the fix will be backported - surely @Andrew Haley <[email protected]> or @Alan Bateman <[email protected]> knows
Il giorno ven 23 gen 2026 alle ore 16:32 Jianbin Chen <[email protected]> ha scritto: > Hi Francesco, > > I'd like to know if there's a similar issue in JDK 21? > > Best Regards. > Jianbin Chen, github-id: funky-eyes > > Francesco Nigro <[email protected]> 于 2026年1月23日周五 23:14写道: > >> In the original code snippet I see named (with a counter) VThreads, so, >> be aware of https://bugs.openjdk.org/browse/JDK-8372410 >> >> Il giorno ven 23 gen 2026 alle ore 15:52 Jianbin Chen <[email protected]> >> ha scritto: >> >>> I'm sorry — I forgot to mention the machine I used for the load test. My >>> server is 2 cores and 4 GB RAM, and the JVM heap was set to 2880m. Under my >>> test load (about 20,000 QPS), with non‑pooled virtual threads you generate >>> at least 20,000 × 8 KB = ~156 MB of byte[] allocations per second just from >>> that 8 KB buffer; that doesn't include other object allocations. With a >>> 2880 MB heap this allocation rate already forces very frequent GC, and >>> frequent GC raises CPU usage, which in turn significantly increases average >>> response time and p99 / p999 latency. >>> >>> Pooling is usually introduced to solve performance issues — object pools >>> and connection pools exist to quickly reuse cached resources and improve >>> performance. So pooling virtual threads also yields obvious benefits, >>> especially for memory‑constrained, I/O‑bound applications (gateways, >>> proxies, etc.) that are sensitive to latency. >>> >>> Best Regards. >>> Jianbin Chen, github-id: funky-eyes >>> >>> Robert Engels <[email protected]> 于 2026年1月23日周五 22:20写道: >>> >>>> I understand. I was trying explain how you can not use thread locals >>>> and maintain the performance. It’s unlikely allocating a 8k buffer is a >>>> performance bottleneck in a real program if the task is not cpu bound >>>> (depending on the granularity you make your tasks) - but 2M tasks running >>>> simultaneously would require 16gb of memory not including the stack. >>>> >>>> You cannot simply use the thread per task model without an >>>> understanding of the cpu, IO, and memory footprints of your tasks and then >>>> configure appropriately. >>>> >>>> On Jan 23, 2026, at 8:10 AM, Jianbin Chen <[email protected]> wrote: >>>> >>>> >>>> I'm sorry, Robert—perhaps I didn't explain my example clearly enough. >>>> Here's the code in question: >>>> >>>> ```java >>>> Executor executor2 = new ThreadPoolExecutor( >>>> 200, >>>> Integer.MAX_VALUE, >>>> 0L, >>>> java.util.concurrent.TimeUnit.SECONDS, >>>> new SynchronousQueue<>(), >>>> Thread.ofVirtual().name("test-threadpool-", 1).factory() >>>> ); >>>> ``` >>>> >>>> In this example, the pooled virtual threads don't implement any >>>> backpressure mechanism; they simply maintain a core pool of 200 virtual >>>> threads. Given that the queue is a `SynchronousQueue` and the maximum pool >>>> size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed 200, >>>> its behavior becomes identical to that of non-pooled virtual threads. >>>> >>>> From my perspective, this example demonstrates that the benefits of >>>> pooling virtual threads outweigh those of creating a new virtual thread for >>>> every single task. In IO-bound scenarios, the virtual threads are directly >>>> reused rather than being recreated each time, and the memory footprint of >>>> virtual threads is far smaller than that of platform threads (which are >>>> controlled by the `-Xss` flag). Additionally, with pooled virtual threads, >>>> the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can >>>> also be reused, which further reduces overall memory usage—wouldn't you >>>> agree? >>>> >>>> Best Regards. >>>> Jianbin Chen, github-id: funky-eyes >>>> >>>> Robert Engels <[email protected]> 于 2026年1月23日周五 21:52写道: >>>> >>>>> Because VT are so efficient to create, without any back pressure they >>>>> will all be created and running at essentially the same time (dramatically >>>>> raising the amount of memory in use) - versus with a pool of size N you >>>>> will have at most N running at once. In a REAL WORLD application there are >>>>> often external limiters (like number of tcp connections) that provide a >>>>> limit. >>>>> >>>>> If your tasks are purely cpu bound you should probably be using a >>>>> capped thread pool of platform threads as it makes no sense to have more >>>>> threads than available cores. >>>>> >>>>> >>>>> >>>>> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <[email protected]> wrote: >>>>> >>>>> >>>>> The question is why I need to use a semaphore to control the number of >>>>> concurrently running tasks. In my particular example, the goal is simply >>>>> to >>>>> keep the concurrency level the same across different thread pool >>>>> implementations so I can fairly compare which one completes all the tasks >>>>> faster. This isn't solely about memory consumption—purely from a >>>>> **performance** perspective (e.g., total throughput or wall-clock time to >>>>> finish the workload), the same number of concurrent tasks completes >>>>> noticeably faster when using pooled virtual threads. >>>>> >>>>> My email probably didn't explain this clearly enough. In reality, I >>>>> have two main questions: >>>>> >>>>> 1. When a third-party library uses `ThreadLocal` as a cache/pool >>>>> (e.g., to hold expensive reusable objects like connections, formatters, or >>>>> parsers), is switching to a **pooled virtual thread executor** the only >>>>> viable solution—assuming we cannot modify the third-party library code? >>>>> >>>>> 2. When running the exact same number of concurrent tasks, pooled >>>>> virtual threads deliver better performance. >>>>> >>>>> Both questions point toward the same conclusion: for an application >>>>> originally built around a traditional platform thread pool, after >>>>> upgrading >>>>> to JDK 21/25, moving to a **pooled virtual threads** approach is generally >>>>> superior to simply using non-pooled (unbounded) virtual threads. >>>>> >>>>> If any part of this reasoning or conclusion is mistaken, I would >>>>> really appreciate being corrected — thank you very much in advance for any >>>>> feedback or different experiences you can share! >>>>> >>>>> Best Regards. >>>>> Jianbin Chen, github-id: funky-eyes >>>>> >>>>> robert engels <[email protected]> 于 2026年1月23日周五 20:58写道: >>>>> >>>>>> Exactly, this is your problem. The total number of tasks will all be >>>>>> running at once in the thread per task model. >>>>>> >>>>>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <[email protected]> wrote: >>>>>> >>>>>> >>>>>> Hi Robert, >>>>>> >>>>>> Thanks you, but I'm a bit confused. In the example above, I only set >>>>>> the core pool size to 200 virtual threads, but for the specific test case >>>>>> we’re talking about, the concurrency isn’t actually being limited by the >>>>>> pool size at all. Since the maximum thread count is Integer.MAX_VALUE and >>>>>> it’s using a SynchronousQueue, tasks are handed off immediately and a new >>>>>> thread gets created to run them right away anyway. >>>>>> >>>>>> Best Regards. >>>>>> Jianbin Chen, github-id: funky-eyes >>>>>> >>>>>> robert engels <[email protected]> 于 2026年1月23日周五 20:28写道: >>>>>> >>>>>>> Try using a semaphore to limit the maximum number of tasks in >>>>>>> progress at anyone time - that is what is causing your memory spike. >>>>>>> Think >>>>>>> of it this way since VT threads are so cheap to create - you are >>>>>>> essentially creating them all at once - making the working set size >>>>>>> equally >>>>>>> to the maximum. So you have N * WSS, where as in the other you have >>>>>>> POOLSIZE * WSS. >>>>>>> >>>>>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> Hi Alan, >>>>>>> >>>>>>> Thanks for your reply and for mentioning JEP 444. >>>>>>> I’ve gone through the guidance in JEP 444 and have some >>>>>>> understanding of it — which is exactly why I’m feeling a bit puzzled in >>>>>>> practice and would really like to hear your thoughts. >>>>>>> >>>>>>> Background — ThreadLocal example (Aerospike) >>>>>>> ```java >>>>>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new >>>>>>> ThreadLocal<byte[]>() { >>>>>>> @Override >>>>>>> protected byte[] initialValue() { >>>>>>> return new byte[DefaultBufferSize]; >>>>>>> } >>>>>>> }; >>>>>>> ``` >>>>>>> This Aerospike code allocates a default 8KB byte[] whenever a new >>>>>>> thread is created and stores it in a ThreadLocal for per-thread caching. >>>>>>> >>>>>>> My concern >>>>>>> - With a traditional platform-thread pool, those ThreadLocal byte[] >>>>>>> instances are effectively reused because threads are long-lived and >>>>>>> pooled. >>>>>>> - If we switch to creating a brand-new virtual thread per task (no >>>>>>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], >>>>>>> which >>>>>>> leads to many short-lived 8KB allocations. >>>>>>> - That raises allocation rate and GC pressure (despite collectors >>>>>>> like ZGC), because ThreadLocal caches aren’t reused when threads are >>>>>>> ephemeral. >>>>>>> >>>>>>> So my question is: for applications originally designed around >>>>>>> platform-thread pools, wouldn’t partially pooling virtual threads be >>>>>>> beneficial? For example, Tomcat’s default max threads is 200 — if I >>>>>>> keep a >>>>>>> pool of 200 virtual threads, then when load exceeds that core size, a >>>>>>> SynchronousQueue will naturally cause new virtual threads to be created >>>>>>> on >>>>>>> demand. This seems to preserve the behavior that ThreadLocal-based >>>>>>> libraries expect, without losing the ability to expand under spikes. >>>>>>> Since >>>>>>> virtual threads are very lightweight, pooling a reasonable number (e.g., >>>>>>> 200) seems to have negligible memory downside while retaining >>>>>>> ThreadLocal >>>>>>> cache effectiveness. >>>>>>> >>>>>>> Empirical test I ran >>>>>>> (I ran a microbenchmark comparing an unpooled per-task >>>>>>> virtual-thread executor and a ThreadPoolExecutor that keeps 200 core >>>>>>> virtual threads.) >>>>>>> >>>>>>> ```java >>>>>>> public static void main(String[] args) throws InterruptedException { >>>>>>> Executor executor = >>>>>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-", >>>>>>> 1).factory()); >>>>>>> Executor executor2 = new ThreadPoolExecutor( >>>>>>> 200, >>>>>>> Integer.MAX_VALUE, >>>>>>> 0L, >>>>>>> java.util.concurrent.TimeUnit.SECONDS, >>>>>>> new SynchronousQueue<>(), >>>>>>> Thread.ofVirtual().name("test-threadpool-", 1).factory() >>>>>>> ); >>>>>>> >>>>>>> // Warm-up >>>>>>> for (int i = 0; i < 10100; i++) { >>>>>>> executor.execute(() -> { >>>>>>> // simulate I/O wait >>>>>>> try { Thread.sleep(100); } catch (InterruptedException >>>>>>> e) { throw new RuntimeException(e); } >>>>>>> }); >>>>>>> executor2.execute(() -> { >>>>>>> // simulate I/O wait >>>>>>> try { Thread.sleep(100); } catch (InterruptedException >>>>>>> e) { throw new RuntimeException(e); } >>>>>>> }); >>>>>>> } >>>>>>> >>>>>>> // Ensure JIT + warm-up complete >>>>>>> Thread.sleep(5000); >>>>>>> >>>>>>> long start = System.currentTimeMillis(); >>>>>>> CountDownLatch countDownLatch = new CountDownLatch(50000); >>>>>>> for (int i = 0; i < 50000; i++) { >>>>>>> executor.execute(() -> { >>>>>>> try { Thread.sleep(100); countDownLatch.countDown(); } >>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); } >>>>>>> }); >>>>>>> } >>>>>>> countDownLatch.await(); >>>>>>> System.out.println("thread time: " + (System.currentTimeMillis() >>>>>>> - start) + " ms"); >>>>>>> >>>>>>> start = System.currentTimeMillis(); >>>>>>> CountDownLatch countDownLatch2 = new CountDownLatch(50000); >>>>>>> for (int i = 0; i < 50000; i++) { >>>>>>> executor2.execute(() -> { >>>>>>> try { Thread.sleep(100); countDownLatch2.countDown(); } >>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); } >>>>>>> }); >>>>>>> } >>>>>>> countDownLatch.await(); >>>>>>> System.out.println("thread pool time: " + >>>>>>> (System.currentTimeMillis() - start) + " ms"); >>>>>>> } >>>>>>> ``` >>>>>>> >>>>>>> Result summary >>>>>>> - In my runs, the pooled virtual-thread executor (executor2) >>>>>>> performed better than the unpooled per-task virtual-thread executor. >>>>>>> - Even when I increased load by 10x or 100x, the pooled >>>>>>> virtual-thread executor still showed better performance. >>>>>>> - In realistic workloads, it seems pooling some virtual threads >>>>>>> reduces allocation/GC overhead and improves throughput compared to >>>>>>> strictly >>>>>>> unpooled virtual threads. >>>>>>> >>>>>>> Final thought / request for feedback >>>>>>> - From my perspective, for systems originally tuned for >>>>>>> platform-thread pools, partially pooling virtual threads seems to have >>>>>>> no >>>>>>> obvious downside and can restore ThreadLocal cache effectiveness used by >>>>>>> many third-party libraries. >>>>>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread >>>>>>> semantics, or ThreadLocal behavior, please point out what I’m missing. >>>>>>> I’d >>>>>>> appreciate your guidance. >>>>>>> >>>>>>> Best Regards. >>>>>>> Jianbin Chen, github-id: funky-eyes >>>>>>> >>>>>>> Alan Bateman <[email protected]> 于 2026年1月23日周五 17:27写道: >>>>>>> >>>>>>>> On 23/01/2026 07:30, Jianbin Chen wrote: >>>>>>>> > : >>>>>>>> > >>>>>>>> > So my question is: >>>>>>>> > >>>>>>>> > **In scenarios where third-party libraries heavily rely on >>>>>>>> ThreadLocal >>>>>>>> > for caching / buffering (and we cannot change those libraries to >>>>>>>> use >>>>>>>> > object pools instead), is explicitly pooling virtual threads >>>>>>>> (using a >>>>>>>> > ThreadPoolExecutor with virtual thread factory) considered a >>>>>>>> > recommended / acceptable workaround?** >>>>>>>> > >>>>>>>> > Or are there better / more idiomatic ways to handle this kind of >>>>>>>> > compatibility issue with legacy ThreadLocal-based libraries when >>>>>>>> > migrating to virtual threads? >>>>>>>> > >>>>>>>> > I have already opened a related discussion in the Dubbo project >>>>>>>> (since >>>>>>>> > Dubbo is one of the libraries affected in our stack): >>>>>>>> > >>>>>>>> > https://github.com/apache/dubbo/issues/16042 >>>>>>>> > >>>>>>>> > Would love to hear your thoughts — especially from people who >>>>>>>> have >>>>>>>> > experience running large-scale virtual-thread-based services with >>>>>>>> > mixed third-party dependencies. >>>>>>>> > >>>>>>>> >>>>>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual >>>>>>>> threads >>>>>>>> and to avoid caching costing resources in thread locals. Virtual >>>>>>>> threads >>>>>>>> support thread locals of course but that is not useful when some >>>>>>>> library >>>>>>>> is looking to share a costly resource between tasks that run on the >>>>>>>> same >>>>>>>> thread in a thread pool. >>>>>>>> >>>>>>>> I don't know anything about Aerospike but working with the >>>>>>>> maintainers >>>>>>>> of that library to re-work its buffer management seems like the >>>>>>>> right >>>>>>>> course of action here. Your mail says "byte buffers". If this is >>>>>>>> ByteBuffer it might be that they are caching direct buffers as they >>>>>>>> are >>>>>>>> expensive to create (and managed by the GC). Maybe they could look >>>>>>>> at >>>>>>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory >>>>>>>> segment) and allocate from an arena that better matches the >>>>>>>> lifecycle. >>>>>>>> >>>>>>>> Hopefully others will share their experiences with migration as it >>>>>>>> is >>>>>>>> indeed challenging to migrate code developed for thread pools to >>>>>>>> work >>>>>>>> efficiently on virtual threads where there is 1-1 relationship >>>>>>>> between >>>>>>>> the task to execute and the thread. >>>>>>>> >>>>>>>> -Alan >>>>>>>> >>>>>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables >>>>>>>> >>>>>>>
