## Summary

This PR proposes to introduce a pooled confined arena as an optimization for 
`Arena.ofConfined()`, where small native allocations can be served from a 
reusable per-thread/per-slot memory pool instead of calling the regular native 
allocator for every short-lived arena. The arena remains confined to its owner 
thread and is still closed normally, but its backing storage can be reset and 
reused when the arena closes. The feature requires no API changes.

### Outline

Platform threads: one lazily allocated pool per Thread, encoded in 
`Thread.confinedMemoryPool`.
Virtual threads: fixed shared native pool with CAS-protected slots, because 
per-virtual-thread native pools would not scale.

Pooled memory is zeroed out upon _closing_ an Arena to minimize data visibility 
between reuse. This means the data is visible only within a TWR block, and 
never outside it.

By default, a confined arena has access to 64 bytes of pooled data.  The pool 
size is configurable via a system property and can be 8, 16, 32, or 64 bytes. 
Pooling can also be turned off completely by setting the pool power-of-two size 
to zero. Nested confined arenas are not supported

## Static Analysis

An extensive static corpus analysis of third-party libraries and the JDK itself 
has been conducted with respect to `Area.ofConfined()` usage, revealing that 
confined arenas were used _only_ in TWR blocks and _never_ in an unstructured 
way. The static analysis further revealed that in most cases, only a small 
amount of native memory was ever allocated, usually less than 32 bytes, and in 
many cases, 8 bytes or less. This usage pattern lends itself well to pooling. 

## Dynamic Analysis

A dynamic statistical analysis of actual runs was also made, where various 
properties of confined arenas were recorded and summarized during a complete 
tier1 test run. While a tier1 run is not necessarily representative of a 
typical application workload, it provided some interesting results:

The run produced 93 per-process histogram blocks and 788,773,092 closed 
confined arenas. The result is dominated by arenas with no native allocation at 
all: 375,934,768 arenas (47.661%) are in the zero-byte bucket. Counting arenas 
up to 63 bytes covers 99.997% of all arena closures.

The largest count bucket is 8-15 bytes per arena with 400,951,293 arenas 
(50.832% of all arenas). The largest byte bucket is 8-15 bytes per arena with 
3,207,623,039 B (3,059.03 MiB) (46.794% of all bytes). Buckets below 64 KiB 
preserve very close to 100% of arena count and 53.583% of bytes.

Metric | Value
-- | --
Histogram blocks / JVMs | 93 / 93
Total confined arenas closed | 788,773,092
Total allocated bytes recorded | 6,854,736,704 B (6,537.19 MiB)
Average bytes per arena | 8.69 B
Arenas with zero allocated bytes | 375,934,768 (47.661%)
Arenas <= 15 bytes | 776,891,372 (98.494%)
Arenas <= 31 bytes | 778,210,621 (98.661%)
Bytes from arenas <= 31 bytes | 3,231,340,337 B (3,081.65 MiB) (47.140%)
Arenas <= 63 bytes | 788,747,109 (99.997%)
Bytes from arenas <= 63 bytes | 3,664,546,412 B (3,494.78 MiB) (53.460%)
Arenas < 64 KiB | 788,772,608 (100.000%)
Bytes from arenas < 64 KiB | 3,672,973,605 B (3,502.82 MiB) (53.583%)
Arenas >= 64 KiB | 484 (0.000061%)
Bytes from arenas >= 64 KiB | 3,181,763,099 B (3,034.37 MiB) (46.417%)
Arenas >= 512 KiB | 473 (0.000060%)
Bytes from arenas >= 512 KiB | 3,178,724,123 B (3,031.47 MiB) (46.373%)

<img width="1200" height="650" alt="image" 
src="https://github.com/user-attachments/assets/fbcb5370-6835-4807-8cf7-cc539805c123";
 />

<img width="1200" height="650" alt="image" 
src="https://github.com/user-attachments/assets/5869a282-44a5-498c-a8cd-3f0cc26328ed";
 />

<img width="1200" height="650" alt="image" 
src="https://github.com/user-attachments/assets/d833f510-3c7d-488e-b9c2-989ad89e5177";
 />

Capturing total allocation per arena shows a surprisingly large population of 
arenas that close without native allocation. This changes the count-weighted 
view substantially: zero-byte arenas alone represent 47.661% of closures, while 
arenas at or below 63 bytes represent 99.997%.

The byte-weighted view is led by the 8-15 byte bucket, with additional weight 
from the 32-63 byte bucket and the large-allocation tail. The 8-15 byte bucket 
contributes 46.794% of all bytes, the 32-63 byte bucket contributes 6.320%, and 
arenas at or above 64 KiB are only 0.000061% of closures but contribute 46.417% 
of bytes. This suggests separating zero/tiny arenas from larger 
total-allocation arenas when evaluating confined-arena cache policy.

Looking at a more granular histogram where we can observe all the occurrences 
of 0-63 byte buckets individually (and where the 63-byte bucket accounts for 63 
bytes and above), we can see that:

 - The workload is dominated by empty arenas and 8-byte arenas: together they 
account for about 98.34% of all arenas.
 - The 8-byte bucket alone accounts for 52.92% of arenas and 47.00% of total 
allocated bytes.
 - The 63+ bucket is tiny by arena count, but contributes 46.07% of total 
bytes, so large arenas are rare but byte-heavy.
 - This distribution strongly supports optimizing small confined arenas, 
especially the 8-byte case.

<img width="1100" height="620" alt="image" 
src="https://github.com/user-attachments/assets/ec0429b3-e4e8-48b5-a128-bf51418bc283";
 />


The above data supports that picking a 64-byte-sized pool would enable most 
allocations to be served from the pool.

## Performance

Pooling improves tiny confined allocations consistently across all platforms. 
For 5-byte allocations, speedups range from 6.8x on Linux x86 to 18.6x on macOS 
aarch64. For 20-byte allocations, speedups range from 7.8x to 18.9x.

For 100 bytes and larger, results are broadly neutral (as the default pool size 
is 64 bytes), with small wins/losses around parity. That is consistent with the 
feature targeting small allocations that fit in the cached pool slot; larger 
allocations mostly follow the previous, regular path.


Platform | 5-byte speedup | 20-byte speedup | 100+ byte behavior
-- | -- | -- | --
Linux aarch64 | 12.3x | 8.6x | Near parity
Linux x64 | 6.8x | 7.8x | Near parity
MacOSX aarch64 | 18.6x | 16.8x | Near parity
Windows x64 | 16.8x | 18.9x | Near parity

<img width="1890" height="1007" alt="perf-platform-speedup" 
src="https://github.com/user-attachments/assets/7c428faa-18c6-411e-8768-1abbf2a6788a";
 />

Virtual thread performance gains are a bit smaller due to the extra CAS 
overhead, but are still several times faster than the previous, regular path.

### Fine-Grained Run

A more fine-grained run on a Mac M1 shows a large and consistent improvement 
from the confined allocation pool. Across allocation sizes 0..64, pooled 
confined allocation averages 1.373 ns/op, compared with 17.404 ns/op without 
the pool.

That corresponds to an average speedup of 12.7x, or about 92.1% lower 
allocation time. The speedup remains strong across the entire measured range:

Minimum speedup: 9.6x at size 64
Maximum speedup: 13.9x at size 27
Typical pooled allocation time: roughly 1.3-1.4 ns/op up to size 40


<img width="1980" height="1350" alt="mac-m1-confined-pool-benchmark" 
src="https://github.com/user-attachments/assets/a10a28e9-0b2e-455f-b9fb-6341cb461146";
 />

### General Observations

A performance test with three consecutive allocations (rather than just one) 
shows an even more pronounced gain. 

It is also believed that pooling would reduce the load on the operating system 
itself, as there are fewer malloc/free calls.

## Security and Memory Safety

 * Pool field is package-private, not protected.
 * Pool contents are zeroed before reuse.
 * Closed pooled segments remain closed and cannot access reused memory.
 * Shared virtual-thread slots are released only after zeroing.
 * Defensive release paths reject release without acquire and illegal zeroing 
sizes.
 * Thread termination cleanup clears/frees/releases acquired pool state.
 * CAS operations ensure memory ordering and visibility for zeroing pools that 
are sharable across virtual threads.

## Workloads

Workloads most likely to benefit are those that use the FFM API with many 
short-lived `Arena.ofConfined()` allocations, especially tiny native segments.

### Most likely winners:

 - Native library bindings using FFM in hot paths, especially jextract-style 
wrappers.
 - Code that does `try (Arena arena = Arena.ofConfined()) { ... }` around each 
native call.
 - Allocations of small C values or structs: int*, long*, handles, pointers, 
small out-parameters, small option structs.
 - Small `allocateFrom(...)` calls, for example, short byte arrays or short 
strings passed to native code.
 - Latency-sensitive loops where the fixed cost of creating/closing confined 
arenas dominates the actual native allocation size.
 - Workloads with per-thread, non-overlapping confined arenas, or only a small 
number of concurrently live confined arenas.

### Less likely to benefit:

- Large native allocations, since they fall back to the regular path.
 - Long-lived arenas with many allocations, where arena creation/close overhead 
is already amortized.
 - Arena.ofShared(), Arena.ofAuto(), or Arena.global() users.
 - Code already using custom slicing/pooling allocators.
 - Workloads with many concurrently live confined arenas per thread beyond the 
pool slot count.

In short, this PR helps FFM code that treats confined arenas as cheap, scoped 
scratch space for small native call arguments. The benchmark shape supports 
that: large gains for 0-64 byte allocations, near-neutral behavior for larger 
sizes.

## Memory Allocation Concerns

Care has been taken to mitigate memory allocation by lazily allocating the pool 
on a per-need basis. Threads that do not use confined arenas will never 
allocate memory pools. Only a single `long` field overhead is imposed on 
`Thread`. 

Confined arenas that allocate larger segments will not acquire a pool until an 
allocation can fit in the pool.

When a thread stops, the pool is deterministically freed and returned to the 
system.

## Discussion Points

 - Should pooling be enabled by default or only enabled by setting a system 
property?
 - Should the default pool size be 64 bytes or something else?

## CSR Statement

No CSR is needed: this change does not modify any public API, exported module 
boundary, or documented system property; it only changes internal allocation 
strategy and tests/benchmarks.

## Future Work

 - If this PR is integrated, we can consider removing the internal 
`CaptureStateUtil` class.
 - Other arena types may also be pooled.

## Testing

This PR passes the following tiers on multiple platforms:
- [X] tier1
- [X] tier2
- [X] tier3
- [X] tier4

## Reviewer's Quick Guidelines

Pool state invariants:
- Platform Thread.confinedMemoryPool:
  0 = no pool allocated
  positive = allocated and available
  negative = allocated and currently acquired

- VirtualThread.confinedMemoryPool:
  0 = no shared slot held
  positive = address of currently acquired shared slot

- Pooled memory is zeroed before being made available for reuse (minimized data 
exposure).
- If pool acquisition fails, allocation falls back to regular native allocation.
- Disabled pool size is represented as -1 and must not allocate/acquire pool 
memory.

---------
- [x] I confirm that I make this contribution in accordance with the [OpenJDK 
Interim AI Policy](https://openjdk.org/legal/ai).

-------------

Commit messages:
 - Merge branch 'master' into rfe-cached-arena-ofconfined-clean
 - Protect agains OOME
 - Mitigate inlining issues
 - Add thread cleanup test
 - Add tests
 - Improve defensive robustness
 - Fix thread factory issue
 - Improve testing
 - Improve comments
 - Fix test issues and visibility
 - ... and 14 more: https://git.openjdk.org/jdk/compare/f4b13d9c...3c5c97b4

Changes: https://git.openjdk.org/jdk/pull/31365/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=31365&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8385697
  Stats: 1337 lines in 18 files changed: 1317 ins; 6 del; 14 mod
  Patch: https://git.openjdk.org/jdk/pull/31365.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/31365/head:pull/31365

PR: https://git.openjdk.org/jdk/pull/31365

Reply via email to