## Summary This PR proposes to introduce a pooled confined arena as an optimization for `Arena.ofConfined()`, where small native allocations can be served from a reusable per-thread/per-slot memory pool instead of calling the regular native allocator for every short-lived arena. The arena remains confined to its owner thread and is still closed normally, but its backing storage can be reset and reused when the arena closes. The feature requires no API changes.
### Outline Platform threads: one lazily allocated pool per Thread, encoded in `Thread.confinedMemoryPool`. Virtual threads: fixed shared native pool with CAS-protected slots, because per-virtual-thread native pools would not scale. Pooled memory is zeroed out upon _closing_ an Arena to minimize data visibility between reuse. This means the data is visible only within a TWR block, and never outside it. By default, a confined arena has access to 64 bytes of pooled data. The pool size is configurable via a system property and can be 8, 16, 32, or 64 bytes. Pooling can also be turned off completely by setting the pool power-of-two size to zero. Nested confined arenas are not supported ## Static Analysis An extensive static corpus analysis of third-party libraries and the JDK itself has been conducted with respect to `Area.ofConfined()` usage, revealing that confined arenas were used _only_ in TWR blocks and _never_ in an unstructured way. The static analysis further revealed that in most cases, only a small amount of native memory was ever allocated, usually less than 32 bytes, and in many cases, 8 bytes or less. This usage pattern lends itself well to pooling. ## Dynamic Analysis A dynamic statistical analysis of actual runs was also made, where various properties of confined arenas were recorded and summarized during a complete tier1 test run. While a tier1 run is not necessarily representative of a typical application workload, it provided some interesting results: The run produced 93 per-process histogram blocks and 788,773,092 closed confined arenas. The result is dominated by arenas with no native allocation at all: 375,934,768 arenas (47.661%) are in the zero-byte bucket. Counting arenas up to 63 bytes covers 99.997% of all arena closures. The largest count bucket is 8-15 bytes per arena with 400,951,293 arenas (50.832% of all arenas). The largest byte bucket is 8-15 bytes per arena with 3,207,623,039 B (3,059.03 MiB) (46.794% of all bytes). Buckets below 64 KiB preserve very close to 100% of arena count and 53.583% of bytes. Metric | Value -- | -- Histogram blocks / JVMs | 93 / 93 Total confined arenas closed | 788,773,092 Total allocated bytes recorded | 6,854,736,704 B (6,537.19 MiB) Average bytes per arena | 8.69 B Arenas with zero allocated bytes | 375,934,768 (47.661%) Arenas <= 15 bytes | 776,891,372 (98.494%) Arenas <= 31 bytes | 778,210,621 (98.661%) Bytes from arenas <= 31 bytes | 3,231,340,337 B (3,081.65 MiB) (47.140%) Arenas <= 63 bytes | 788,747,109 (99.997%) Bytes from arenas <= 63 bytes | 3,664,546,412 B (3,494.78 MiB) (53.460%) Arenas < 64 KiB | 788,772,608 (100.000%) Bytes from arenas < 64 KiB | 3,672,973,605 B (3,502.82 MiB) (53.583%) Arenas >= 64 KiB | 484 (0.000061%) Bytes from arenas >= 64 KiB | 3,181,763,099 B (3,034.37 MiB) (46.417%) Arenas >= 512 KiB | 473 (0.000060%) Bytes from arenas >= 512 KiB | 3,178,724,123 B (3,031.47 MiB) (46.373%) <img width="1200" height="650" alt="image" src="https://github.com/user-attachments/assets/fbcb5370-6835-4807-8cf7-cc539805c123" /> <img width="1200" height="650" alt="image" src="https://github.com/user-attachments/assets/5869a282-44a5-498c-a8cd-3f0cc26328ed" /> <img width="1200" height="650" alt="image" src="https://github.com/user-attachments/assets/d833f510-3c7d-488e-b9c2-989ad89e5177" /> Capturing total allocation per arena shows a surprisingly large population of arenas that close without native allocation. This changes the count-weighted view substantially: zero-byte arenas alone represent 47.661% of closures, while arenas at or below 63 bytes represent 99.997%. The byte-weighted view is led by the 8-15 byte bucket, with additional weight from the 32-63 byte bucket and the large-allocation tail. The 8-15 byte bucket contributes 46.794% of all bytes, the 32-63 byte bucket contributes 6.320%, and arenas at or above 64 KiB are only 0.000061% of closures but contribute 46.417% of bytes. This suggests separating zero/tiny arenas from larger total-allocation arenas when evaluating confined-arena cache policy. Looking at a more granular histogram where we can observe all the occurrences of 0-63 byte buckets individually (and where the 63-byte bucket accounts for 63 bytes and above), we can see that: - The workload is dominated by empty arenas and 8-byte arenas: together they account for about 98.34% of all arenas. - The 8-byte bucket alone accounts for 52.92% of arenas and 47.00% of total allocated bytes. - The 63+ bucket is tiny by arena count, but contributes 46.07% of total bytes, so large arenas are rare but byte-heavy. - This distribution strongly supports optimizing small confined arenas, especially the 8-byte case. <img width="1100" height="620" alt="image" src="https://github.com/user-attachments/assets/ec0429b3-e4e8-48b5-a128-bf51418bc283" /> The above data supports that picking a 64-byte-sized pool would enable most allocations to be served from the pool. ## Performance Pooling improves tiny confined allocations consistently across all platforms. For 5-byte allocations, speedups range from 6.8x on Linux x86 to 18.6x on macOS aarch64. For 20-byte allocations, speedups range from 7.8x to 18.9x. For 100 bytes and larger, results are broadly neutral (as the default pool size is 64 bytes), with small wins/losses around parity. That is consistent with the feature targeting small allocations that fit in the cached pool slot; larger allocations mostly follow the previous, regular path. Platform | 5-byte speedup | 20-byte speedup | 100+ byte behavior -- | -- | -- | -- Linux aarch64 | 12.3x | 8.6x | Near parity Linux x64 | 6.8x | 7.8x | Near parity MacOSX aarch64 | 18.6x | 16.8x | Near parity Windows x64 | 16.8x | 18.9x | Near parity <img width="1890" height="1007" alt="perf-platform-speedup" src="https://github.com/user-attachments/assets/7c428faa-18c6-411e-8768-1abbf2a6788a" /> Virtual thread performance gains are a bit smaller due to the extra CAS overhead, but are still several times faster than the previous, regular path. ### Fine-Grained Run A more fine-grained run on a Mac M1 shows a large and consistent improvement from the confined allocation pool. Across allocation sizes 0..64, pooled confined allocation averages 1.373 ns/op, compared with 17.404 ns/op without the pool. That corresponds to an average speedup of 12.7x, or about 92.1% lower allocation time. The speedup remains strong across the entire measured range: Minimum speedup: 9.6x at size 64 Maximum speedup: 13.9x at size 27 Typical pooled allocation time: roughly 1.3-1.4 ns/op up to size 40 <img width="1980" height="1350" alt="mac-m1-confined-pool-benchmark" src="https://github.com/user-attachments/assets/a10a28e9-0b2e-455f-b9fb-6341cb461146" /> ### General Observations A performance test with three consecutive allocations (rather than just one) shows an even more pronounced gain. It is also believed that pooling would reduce the load on the operating system itself, as there are fewer malloc/free calls. ## Security and Memory Safety * Pool field is package-private, not protected. * Pool contents are zeroed before reuse. * Closed pooled segments remain closed and cannot access reused memory. * Shared virtual-thread slots are released only after zeroing. * Defensive release paths reject release without acquire and illegal zeroing sizes. * Thread termination cleanup clears/frees/releases acquired pool state. * CAS operations ensure memory ordering and visibility for zeroing pools that are sharable across virtual threads. ## Workloads Workloads most likely to benefit are those that use the FFM API with many short-lived `Arena.ofConfined()` allocations, especially tiny native segments. ### Most likely winners: - Native library bindings using FFM in hot paths, especially jextract-style wrappers. - Code that does `try (Arena arena = Arena.ofConfined()) { ... }` around each native call. - Allocations of small C values or structs: int*, long*, handles, pointers, small out-parameters, small option structs. - Small `allocateFrom(...)` calls, for example, short byte arrays or short strings passed to native code. - Latency-sensitive loops where the fixed cost of creating/closing confined arenas dominates the actual native allocation size. - Workloads with per-thread, non-overlapping confined arenas, or only a small number of concurrently live confined arenas. ### Less likely to benefit: - Large native allocations, since they fall back to the regular path. - Long-lived arenas with many allocations, where arena creation/close overhead is already amortized. - Arena.ofShared(), Arena.ofAuto(), or Arena.global() users. - Code already using custom slicing/pooling allocators. - Workloads with many concurrently live confined arenas per thread beyond the pool slot count. In short, this PR helps FFM code that treats confined arenas as cheap, scoped scratch space for small native call arguments. The benchmark shape supports that: large gains for 0-64 byte allocations, near-neutral behavior for larger sizes. ## Memory Allocation Concerns Care has been taken to mitigate memory allocation by lazily allocating the pool on a per-need basis. Threads that do not use confined arenas will never allocate memory pools. Only a single `long` field overhead is imposed on `Thread`. Confined arenas that allocate larger segments will not acquire a pool until an allocation can fit in the pool. When a thread stops, the pool is deterministically freed and returned to the system. ## Discussion Points - Should pooling be enabled by default or only enabled by setting a system property? - Should the default pool size be 64 bytes or something else? ## CSR Statement No CSR is needed: this change does not modify any public API, exported module boundary, or documented system property; it only changes internal allocation strategy and tests/benchmarks. ## Future Work - If this PR is integrated, we can consider removing the internal `CaptureStateUtil` class. - Other arena types may also be pooled. ## Testing This PR passes the following tiers on multiple platforms: - [X] tier1 - [X] tier2 - [X] tier3 - [X] tier4 ## Reviewer's Quick Guidelines Pool state invariants: - Platform Thread.confinedMemoryPool: 0 = no pool allocated positive = allocated and available negative = allocated and currently acquired - VirtualThread.confinedMemoryPool: 0 = no shared slot held positive = address of currently acquired shared slot - Pooled memory is zeroed before being made available for reuse (minimized data exposure). - If pool acquisition fails, allocation falls back to regular native allocation. - Disabled pool size is represented as -1 and must not allocate/acquire pool memory. --------- - [x] I confirm that I make this contribution in accordance with the [OpenJDK Interim AI Policy](https://openjdk.org/legal/ai). ------------- Commit messages: - Merge branch 'master' into rfe-cached-arena-ofconfined-clean - Protect agains OOME - Mitigate inlining issues - Add thread cleanup test - Add tests - Improve defensive robustness - Fix thread factory issue - Improve testing - Improve comments - Fix test issues and visibility - ... and 14 more: https://git.openjdk.org/jdk/compare/f4b13d9c...3c5c97b4 Changes: https://git.openjdk.org/jdk/pull/31365/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=31365&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8385697 Stats: 1337 lines in 18 files changed: 1317 ins; 6 del; 14 mod Patch: https://git.openjdk.org/jdk/pull/31365.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/31365/head:pull/31365 PR: https://git.openjdk.org/jdk/pull/31365
