Re: RFR (14) 8235837: Memory access API refinements

Maurizio Cimadamore Wed, 15 Jan 2020 10:50:18 -0800

Maybe this would be best moved on panama-dev?

In any case, for obtaining best performances, it is best to use anindexed (or strided) var handle - your loop will create a new memoryaddress on each new iteration, which will not be a problem onceMemoryAddress will be an inline type, but in the meantime...


We have some benchmarks here:

http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign

Your test seems similar to this:

http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNew.java

In the panama repo this benchmark obtains same numbers as bytebuffer,and same loop unrolling (but the panama repo has one performanceoptimization that JDK 14 doesn't yet have, to workaround the lack ofoptimization with longs used in loops). This has been rectified with animplementation change which allows us to use ints instead of longs inbound checks, when the API can prove that the segment is small - thatwork is described in this thread:


https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html

And the corresponding, longer term C2 fix is captured here:

https://bugs.openjdk.java.net/browse/JDK-8223051

That said, even w/o that performance fix, I wouldn't expect the memoryaccess API to be 4x slower. I'd start by dropping the acquire() [whichyou probably don't need and it's doing a CAS], and moving to indexed varhandle (by replicating the benchmark code linked above) and see if thatworks better.


Maurizio

On 15/01/2020 18:00, Andrew Haley wrote:

On 1/9/20 4:37 PM, Maurizio Cimadamore wrote:

There you go

cr.openjdk.java.net/~mcimadamore/8235837_javadoc

Thank you.

So I've been kicking the tyres, and I'm rather surprised at how poor
the performance seems to be. My simple test, like this:

     @Benchmark
     public void intHandleTest(BenchmarkState state) {
         try (var segment = BenchmarkState.segment.acquire()) {
             var base = segment.baseAddress();
             final var byteSize = ARRAY_SIZE * 4;
             for (int i = 0; i < byteSize; i += 4) {
                 BenchmarkState.intHandle.set(base.offset(i), (int) 4);
             }
         }
     }

has a great deal of overhead. It was a bit of a struggle to get it to
unroll nicely, and the best I could get was

   6.90%  │  0x00007faeeff7dec8:   mov    r9d,r11d
          │  0x00007faeeff7decb:   add    r9d,0x4                      ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - 
org.sample.MemoryHandlesTest::intHandleTest@45 (line 34)
          │                                                            ; - 
org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub@17
 (line 191)
          │  0x00007faeeff7decf:   mov    rdx,rbx
          │  0x00007faeeff7ded2:   add    rdx,0x10                     ;*i2l 
{reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - 
org.sample.MemoryHandlesTest::intHandleTest@35 (line 35)
          │                                                            ; - 
org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub@17
 (line 191)
   0.06%  │  0x00007faeeff7ded6:   cmp    rdx,rdi
          │  0x00007faeeff7ded9:   jg     0x00007faeeff7df94           ;*ifle 
{reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - 
jdk.internal.foreign.MemorySegmentImpl::checkBounds@20 (line 196)
          │                                                            ; - 
jdk.internal.foreign.MemorySegmentImpl::checkRange@29 (line 178)
          │                                                            ; - 
jdk.internal.foreign.MemoryAddressImpl::checkAccess@21 (line 84)
          │                                                            ; - 
java.lang.invoke.VarHandleMemoryAddressAsInts::checkAddress@15 (line 50)
          │                                                            ; - 
java.lang.invoke.VarHandleMemoryAddressAsInts::set0@7 (line 85)
          │                                                            ; - 
java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800bc3840::set@7
          │                                                            ; - 
java.lang.invoke.VarHandleGuards::guard_LI_V@33 (line 114)
          │                                                            ; - 
org.sample.MemoryHandlesTest::intHandleTest@42 (line 35)
          │                                                            ; - 
org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub@17
 (line 191)
          │  0x00007faeeff7dedf:   mov    DWORD PTR [rsi+0x10],0x4     
;*invokevirtual putIntUnaligned {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - 
jdk.internal.misc.Unsafe::putIntUnaligned@10 (line 3693)
          │                                                            ; - 
java.lang.invoke.VarHandleMemoryAddressAsInts::set0@38 (line 86)
          │                                                            ; - 
java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800bc3840::set@7
          │                                                            ; - 
java.lang.invoke.VarHandleGuards::guard_LI_V@33 (line 114)
          │                                                            ; - 
org.sample.MemoryHandlesTest::intHandleTest@42 (line 35)
          │                                                            ; - 
org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub@17
 (line 191)

for every store. In contrast, similar ByteBuffer code looks like:


   0.08%   ↗  0x00007f3b5bf717c0:   movsxd r13,r8d
   0.16%   │  0x00007f3b5bf717c3:   mov    r14,rdx
           │  0x00007f3b5bf717c6:   add    r14,r13
   1.00%   │  0x00007f3b5bf717c9:   movsxd r13,r8d
   0.04%   │  0x00007f3b5bf717cc:   vmovdqu YMMWORD PTR [rdx+r13*1],ymm4
   6.87%   │  0x00007f3b5bf717d2:   vmovdqu YMMWORD PTR [r14+0x20],ymm4
   5.77%   │  0x00007f3b5bf717d8:   vmovdqu YMMWORD PTR [r14+0x40],ymm4
   3.99%   │  0x00007f3b5bf717de:   vmovdqu YMMWORD PTR [r14+0x60],ymm4
   6.09%   │  0x00007f3b5bf717e4:   vmovdqu YMMWORD PTR [r14+0x80],ymm4
   4.97%   │  0x00007f3b5bf717ed:   vmovdqu YMMWORD PTR [r14+0xa0],ymm4
   4.93%   │  0x00007f3b5bf717f6:   vmovdqu YMMWORD PTR [r14+0xc0],ymm4
   5.07%   │  0x00007f3b5bf717ff:   vmovdqu YMMWORD PTR [r14+0xe0],ymm4
   4.87%   │  0x00007f3b5bf71808:   vmovdqu YMMWORD PTR [r14+0x100],ymm4
   7.39%   │  0x00007f3b5bf71811:   vmovdqu YMMWORD PTR [r14+0x120],ymm4
   5.19%   │  0x00007f3b5bf7181a:   vmovdqu YMMWORD PTR [r14+0x140],ymm4
   6.21%   │  0x00007f3b5bf71823:   vmovdqu YMMWORD PTR [r14+0x160],ymm4
   4.93%   │  0x00007f3b5bf7182c:   vmovdqu YMMWORD PTR [r14+0x180],ymm4
   5.69%   │  0x00007f3b5bf71835:   vmovdqu YMMWORD PTR [r14+0x1a0],ymm4
  11.28%   │  0x00007f3b5bf7183e:   vmovdqu YMMWORD PTR [r14+0x1c0],ymm4
   4.83%   │  0x00007f3b5bf71847:   vmovdqu YMMWORD PTR 
[r14+0x1e0],ymm4;*invokevirtual putIntUnaligned {reexecute=0 rethrow=0 
return_oop=0}
           │                                                            ; - 
jdk.internal.misc.Unsafe::putIntUnaligned@10 (line 3693)
           │                                                            ; - 
java.nio.DirectByteBuffer::putInt@18 (line 860)
           │                                                            ; - 
java.nio.DirectByteBuffer::putInt@12 (line 881)
           │                                                            ; - 
org.sample.ByteBufferTest::floss@15 (line 34)
           │                                                            ; - 
org.sample.ByteBufferTest::test@14 (line 42)
           │                                                            ; - 
org.sample.generated.ByteBufferTest_test_jmhTest::test_avgt_jmhStub@17 (line 
241)
   2.85%   │  0x00007f3b5bf71850:   add    r8d,0x200                    ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
           │                                                            ; - 
org.sample.ByteBufferTest::floss@19 (line 33)
           │                                                            ; - 
org.sample.ByteBufferTest::test@14 (line 42)
           │                                                            ; - 
org.sample.generated.ByteBufferTest_test_jmhTest::test_avgt_jmhStub@17 (line 
241)
           │  0x00007f3b5bf71857:   cmp    r8d,ecx
           ╰  0x00007f3b5bf7185a:   jl     0x00007f3b5bf717c0           ;*goto 
{reexecute=0 rethrow=0 return_oop=0}

nice, eh?

Benchmark                             Mode  Cnt     Score       Error  Units
ByteBufferTest.test                   avgt    5   620.628 ±     2.947  ns/op
MemoryHandlesTest.intHandleTest       avgt    5  2778.602 ± 10557.068  ns/op

Could it be that some C2 improvements or similar are proposed?

Re: RFR (14) 8235837: Memory access API refinements

Reply via email to