RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations

Matthias Ernst Sun, 19 Jan 2025 13:09:45 -0800

Certain signatures for foreign function calls require allocation of an 
intermediate buffer to adapt the FFM's to the native stub's calling convention 
("needsReturnBuffer"). In the current implementation, this buffer is malloced 
and freed on every FFM invocation, a non-negligible overhead.


Sample stack trace:

   java.lang.Thread.State: RUNNABLE
        at jdk.internal.misc.Unsafe.allocateMemory0(java.base@25-ea/Native 
Method)
        at 
jdk.internal.misc.Unsafe.allocateMemory(java.base@25-ea/Unsafe.java:636)
        at 
jdk.internal.foreign.SegmentFactories.allocateMemoryWrapper(java.base@25-ea/SegmentFactories.java:215)
        at 
jdk.internal.foreign.SegmentFactories.allocateSegment(java.base@25-ea/SegmentFactories.java:193)
        at 
jdk.internal.foreign.ArenaImpl.allocateNoInit(java.base@25-ea/ArenaImpl.java:55)
        at 
jdk.internal.foreign.ArenaImpl.allocate(java.base@25-ea/ArenaImpl.java:60)
        at 
jdk.internal.foreign.ArenaImpl.allocate(java.base@25-ea/ArenaImpl.java:34)
        at 
java.lang.foreign.SegmentAllocator.allocate(java.base@25-ea/SegmentAllocator.java:645)
        at 
jdk.internal.foreign.abi.SharedUtils$2.<init>(java.base@25-ea/SharedUtils.java:388)
        at 
jdk.internal.foreign.abi.SharedUtils.newBoundedArena(java.base@25-ea/SharedUtils.java:386)
        at 
jdk.internal.foreign.abi.DowncallStub/0x000001f001084c00.invoke(java.base@25-ea/Unknown
 Source)
        at 
java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(java.base@25-ea/DirectMethodHandle$Holder)
        at 
java.lang.invoke.LambdaForm$MH/0x000001f00109a400.invoke(java.base@25-ea/LambdaForm$MH)
        at 
java.lang.invoke.Invokers$Holder.invokeExact_MT(java.base@25-ea/Invokers$Holder)


When does this happen? A fairly easy way to trigger this is through returning a 
small aggregate like the following:

struct Vector2D {
  double x, y;
};
Vector2D Origin() {
  return {0, 0};
}


On AArch64, such a struct is returned in two 128 bit registers v0/v1.
The VM's calling convention for the native stub consequently expects an 32 byte 
output segment argument.
The FFM downcall method handle instead expects to create a 16 byte result 
segment through the application-provided SegmentAllocator, and needs to perform 
an appropriate adaptation, roughly like so:

  MemorySegment downcallMH(SegmentAllocator a) {
    MemorySegment tmp = SharedUtils.allocate(32);
    try {
      nativeStub.invoke(tmp);  // leaves v0, v1 in tmp
      MemorySegment result = a.allocate(16);
      result.setDouble(0, tmp.getDouble(0));
      result.setDouble(8, tmp.getDouble(16));
      return result;
    } finally {
      free(tmp);
    }
  }


You might argue that this cost is not worse than what happens through the 
result allocator. However, the application has control over this, and may 
provide a segment-reusing allocator in a loop:

  MemorySegment result = allocate(resultLayout);
  SegmentAllocator allocator = (_, _)->result;
  loop:
    mh.invoke(allocator);  <= would like to avoid hidden allocations in here


To alleviate this, This PR remembers and reuses one single such intermediate 
buffer per carrier-thread in subsequent calls, very similar to what happens in 
the sun.nio.ch.Util.BufferCache or sun.nio.fs.NativeBuffers, which face a 
similar issues.

Performance (MBA M3):


# VM version: JDK 25-ea, OpenJDK 64-Bit Server VM, 25-ea+3-283
Benchmark                                        Mode  Cnt      Score      
Error   Units
PointsAlloc.circle_by_ptr                        avgt    5      8.964 ±   0.351 
  ns/op
PointsAlloc.circle_by_ptr:·gc.alloc.rate         avgt    5     95.301 ±   3.665 
 MB/sec
PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm    avgt    5      0.224 ±   0.001 
   B/op
PointsAlloc.circle_by_ptr:·gc.count              avgt    5      2.000           
 counts
PointsAlloc.circle_by_ptr:·gc.time               avgt    5      3.000           
     ms
PointsAlloc.circle_by_value                      avgt    5     46.498 ±   2.336 
  ns/op
PointsAlloc.circle_by_value:·gc.alloc.rate       avgt    5  13141.578 ± 650.425 
 MB/sec
PointsAlloc.circle_by_value:·gc.alloc.rate.norm  avgt    5    160.224 ±   0.001 
   B/op
PointsAlloc.circle_by_value:·gc.count            avgt    5    116.000           
 counts
PointsAlloc.circle_by_value:·gc.time             avgt    5     44.000           
     ms

# VM version: JDK 25-internal, OpenJDK 64-Bit Server VM, 
25-internal-adhoc.mernst.jdk
Benchmark                                        Mode  Cnt   Score    Error   
Units
PointsAlloc.circle_by_ptr                        avgt    5   9.108 ±  0.477   
ns/op
PointsAlloc.circle_by_ptr:·gc.alloc.rate         avgt    5  93.792 ±  4.898  
MB/sec
PointsAlloc.circle_by_ptr:·gc.alloc.rate.norm    avgt    5   0.224 ±  0.001    
B/op
PointsAlloc.circle_by_ptr:·gc.count              avgt    5   2.000           
counts
PointsAlloc.circle_by_ptr:·gc.time               avgt    5   4.000              
 ms
PointsAlloc.circle_by_value                      avgt    5  13.180 ±  0.611   
ns/op
PointsAlloc.circle_by_value:·gc.alloc.rate       avgt    5  64.816 ±  2.964  
MB/sec
PointsAlloc.circle_by_value:·gc.alloc.rate.norm  avgt    5   0.224 ±  0.001    
B/op
PointsAlloc.circle_by_value:·gc.count            avgt    5   2.000           
counts
PointsAlloc.circle_by_value:·gc.time             avgt    5   5.000              
 ms

-------------

Commit messages:
 - tiny stylistic changes
 - Storing segment addresses instead of objects in the cache appears to be 
slightly faster. Write barrier?
 - (c)
 - unit test
 - move CallBufferCache out
 - shave off a couple more nanos
 - Add comparison benchmark for out-parameter.
 - copyright header
 - Benchmark:
 - move pinned cache lookup out of constructor.
 - ... and 20 more: https://git.openjdk.org/jdk/compare/8460072f...4a2210df

Changes: https://git.openjdk.org/jdk/pull/23142/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23142&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8287788
  Stats: 402 lines in 7 files changed: 377 ins; 0 del; 25 mod
  Patch: https://git.openjdk.org/jdk/pull/23142.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23142/head:pull/23142

PR: https://git.openjdk.org/jdk/pull/23142

RFR: 8287788: reuse intermediate segments allocated during FFM stub invocations

Reply via email to