Certain signatures for foreign function calls require allocation of an intermediate buffer to adapt the FFM's to the native stub's calling convention ("needsReturnBuffer"). In the current implementation, this buffer is malloced and freed on every FFM invocation, a non-negligible overhead.
Sample stack trace: java.lang.Thread.State: RUNNABLE at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native Method) at jdk.internal.misc.Unsafe.allocateMemory(java.base@25-ea/Unsafe.java:636) at jdk.internal.foreign.SegmentFactories.allocateMemoryWrapper(java.base@25-ea/SegmentFactories.java:215) at jdk.internal.foreign.SegmentFactories.allocateSegment(java.base@25-ea/SegmentFactories.java:193) at jdk.internal.foreign.ArenaImpl.allocateNoInit(java.base@25-ea/ArenaImpl.java:55) at jdk.internal.foreign.ArenaImpl.allocate(java.base@25-ea/ArenaImpl.java:60) at jdk.internal.foreign.ArenaImpl.allocate(java.base@25-ea/ArenaImpl.java:34) at java.lang.foreign.SegmentAllocator.allocate(java.base@25-ea/SegmentAllocator.java:645) at jdk.internal.foreign.abi.SharedUtils$2.<init>(java.base@25-ea/SharedUtils.java:388) at jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386) at jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown Source) at java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@25-ea/DirectMethodHandle$Holder) at java.lang.invoke.LambdaForm$MH/0x000001f00109a400.invoke(java.base@25-ea/LambdaForm$MH) at java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder) When does this happen? A fairly easy way to trigger this is through returning a small aggregate like the following: struct Vector2D { double x, y; }; Vector2D Origin() { return {0, 0}; } On AArch64, such a struct is returned in two 128 bit registers v0/v1. The VM's calling convention for the native stub consequently expects an 32 byte output segment argument. The FFM downcall method handle instead expects to create a 16 byte result segment through the application-provided SegmentAllocator, and needs to perform an appropriate adaptation, roughly like so: MemorySegment downcallMH(SegmentAllocator a) { MemorySegment tmp = SharedUtils.allocate(32); try { nativeStub.invoke(tmp); // leaves v0, v1 in tmp MemorySegment result = a.allocate(16); result.setDouble(0, tmp.getDouble(0)); result.setDouble(8, tmp.getDouble(16)); return result; } finally { free(tmp); } } You might argue that this cost is not worse than what happens through the result allocator. However, the application has control over this, and may provide a segment-reusing allocator in a loop: MemorySegment result = allocate(resultLayout); SegmentAllocator allocator = (_, _)->result; loop: mh.invoke(allocator); <= would like to avoid hidden allocations in here To alleviate this, This PR remembers and reuses one single such intermediate buffer per carrier-thread in subsequent calls, very similar to what happens in the sun.nio.ch.Util.BufferCache or sun.nio.fs.NativeBuffers, which face a similar issues. Performance (MBA M3): # VM version: JDK 25-ea, OpenJDK 64-Bit Server VM, 25-ea+3-283 Benchmark Mode Cnt Score Error Units PointsAlloc.circle_by_ptr avgt 5 8.964 ± 0.351 ns/op PointsAlloc.circle_by_ptr:·gc.alloc.rate avgt 5 95.301 ± 3.665 MB/sec PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm avgt 5 0.224 ± 0.001 B/op PointsAlloc.circle_by_ptr:·gc.count avgt 5 2.000 counts PointsAlloc.circle_by_ptr:·gc.time avgt 5 3.000 ms PointsAlloc.circle_by_value avgt 5 46.498 ± 2.336 ns/op PointsAlloc.circle_by_value:·gc.alloc.rate avgt 5 13141.578 ± 650.425 MB/sec PointsAlloc.circle_by_value:·gc.alloc.rate.norm avgt 5 160.224 ± 0.001 B/op PointsAlloc.circle_by_value:·gc.count avgt 5 116.000 counts PointsAlloc.circle_by_value:·gc.time avgt 5 44.000 ms # VM version: JDK 25-internal, OpenJDK 64-Bit Server VM, 25-internal-adhoc.mernst.jdk Benchmark Mode Cnt Score Error Units PointsAlloc.circle_by_ptr avgt 5 9.108 ± 0.477 ns/op PointsAlloc.circle_by_ptr:·gc.alloc.rate avgt 5 93.792 ± 4.898 MB/sec PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm avgt 5 0.224 ± 0.001 B/op PointsAlloc.circle_by_ptr:·gc.count avgt 5 2.000 counts PointsAlloc.circle_by_ptr:·gc.time avgt 5 4.000 ms PointsAlloc.circle_by_value avgt 5 13.180 ± 0.611 ns/op PointsAlloc.circle_by_value:·gc.alloc.rate avgt 5 64.816 ± 2.964 MB/sec PointsAlloc.circle_by_value:·gc.alloc.rate.norm avgt 5 0.224 ± 0.001 B/op PointsAlloc.circle_by_value:·gc.count avgt 5 2.000 counts PointsAlloc.circle_by_value:·gc.time avgt 5 5.000 ms ------------- Commit messages: - tiny stylistic changes - Storing segment addresses instead of objects in the cache appears to be slightly faster. Write barrier? - (c) - unit test - move CallBufferCache out - shave off a couple more nanos - Add comparison benchmark for out-parameter. - copyright header - Benchmark: - move pinned cache lookup out of constructor. - ... and 20 more: https://git.openjdk.org/jdk/compare/8460072f...4a2210df Changes: https://git.openjdk.org/jdk/pull/23142/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23142&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287788 Stats: 402 lines in 7 files changed: 377 ins; 0 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/23142.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/23142/head:pull/23142 PR: https://git.openjdk.org/jdk/pull/23142