On Mon, 20 Jan 2025 07:30:17 GMT, Matthias Ernst <[email protected]> wrote:
>> Certain signatures for foreign function calls (e.g. HVA return by value)
>> require allocation of an intermediate buffer to adapt the FFM's to the
>> native stub's calling convention. In the current implementation, this buffer
>> is malloced and freed on every FFM invocation, a non-negligible overhead.
>>
>> Sample stack trace:
>>
>> java.lang.Thread.State: RUNNABLE
>> at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native
>> Method)
>> ...
>> at
>> jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386)
>> at
>> jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown
>> Source)
>> ...
>> at
>> java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder)
>>
>>
>> To alleviate this, this PR remembers and reuses up to two small intermediate
>> buffers per carrier-thread in subsequent calls.
>>
>> Performance (MBA M3):
>>
>>
>> master@764d70b7df18e288582e616c62b0d7078f1ff3aa
>> Benchmark Mode Cnt Score Error Units
>> PointsAlloc.circle_by_ptr avgt 30 9.197 ? 0.037 ns/op
>> PointsAlloc.circle_by_value avgt 30 42.195 ? 0.088 ns/op <=
>> #######
>> PointsAlloc.jni_ByteBuffer_alloc avgt 30 226.127 ? 35.378 ns/op
>> PointsAlloc.jni_long_alloc avgt 30 25.297 ? 2.457 ns/op
>> PointsAlloc.panama_alloc avgt 30 27.053 ? 1.915 ns/op
>>
>> After:
>> Benchmark Mode Cnt Score Error Units
>> PointsAlloc.circle_by_ptr avgt 30 9.156 ? 0.021 ns/op
>> PointsAlloc.circle_by_value avgt 30 11.995 ? 0.051 ns/op <=
>> #######
>> PointsAlloc.jni_ByteBuffer_alloc avgt 30 211.161 ? 23.284 ns/op
>> PointsAlloc.jni_long_alloc avgt 30 24.885 ? 2.461 ns/op
>> PointsAlloc.panama_alloc avgt 30 26.905 ? 1.935 ns/op
>>
>>
>> `-prof gc` also shows that the new call path is fully scalar-replaced vs 160
>> byte/call before.
>
> Matthias Ernst has updated the pull request incrementally with one additional
> commit since the last revision:
>
> Implementation notes.
src/java.base/share/classes/jdk/internal/foreign/abi/CallBufferCache.java line
75:
> 73:
> 74: @SuppressWarnings("restricted")
> 75: public static long allocate(long size) {
I don't think we should expose allocation/release methods (see related comment
in `SharedUtils`)
src/java.base/share/classes/jdk/internal/foreign/abi/CallBufferCache.java line
97:
> 95: public static long acquire() {
> 96: // Protect against vthread unmount.
> 97: Continuation.pin();
Other code that does this, first checks if we are in a virtual thread.
Note that we have two options for dealing with problematic cases:
* pin/unpin as soon as we see a virtual thread
* if we have a virtual thread and we detect a race in accessing the cache (e.g.
because of how the virtual threads have been scheduled to the same underlying
carrier thread), we could avoid pinning, but just allocate a fresh new segment
(thus avoiding recycling issues)
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922533447
PR Review Comment: https://git.openjdk.org/jdk/pull/23142#discussion_r1922536958