Maybe this would be best moved on panama-dev?
In any case, for obtaining best performances, it is best to use an
indexed (or strided) var handle - your loop will create a new memory
address on each new iteration, which will not be a problem once
MemoryAddress will be an inline type, but in the meantime...
We have some benchmarks here:
http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign
Your test seems similar to this:
http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNew.java
In the panama repo this benchmark obtains same numbers as bytebuffer,
and same loop unrolling (but the panama repo has one performance
optimization that JDK 14 doesn't yet have, to workaround the lack of
optimization with longs used in loops). This has been rectified with an
implementation change which allows us to use ints instead of longs in
bound checks, when the API can prove that the segment is small - that
work is described in this thread:
https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html
And the corresponding, longer term C2 fix is captured here:
https://bugs.openjdk.java.net/browse/JDK-8223051
That said, even w/o that performance fix, I wouldn't expect the memory
access API to be 4x slower. I'd start by dropping the acquire() [which
you probably don't need and it's doing a CAS], and moving to indexed var
handle (by replicating the benchmark code linked above) and see if that
works better.
Maurizio
On 15/01/2020 18:00, Andrew Haley wrote:
On 1/9/20 4:37 PM, Maurizio Cimadamore wrote:
There you go
cr.openjdk.java.net/~mcimadamore/8235837_javadoc
Thank you.
So I've been kicking the tyres, and I'm rather surprised at how poor
the performance seems to be. My simple test, like this:
@Benchmark
public void intHandleTest(BenchmarkState state) {
try (var segment = BenchmarkState.segment.acquire()) {
var base = segment.baseAddress();
final var byteSize = ARRAY_SIZE * 4;
for (int i = 0; i < byteSize; i += 4) {
BenchmarkState.intHandle.set(base.offset(i), (int) 4);
}
}
}
has a great deal of overhead. It was a bit of a struggle to get it to
unroll nicely, and the best I could get was
6.90% │ 0x00007faeeff7dec8: mov r9d,r11d
│ 0x00007faeeff7decb: add r9d,0x4 ;*iinc
{reexecute=0 rethrow=0 return_oop=0}
│ ; -
org.sample.MemoryHandlesTest::intHandleTest@45 (line 34)
│ ; -
org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub@17
(line 191)
│ 0x00007faeeff7decf: mov rdx,rbx
│ 0x00007faeeff7ded2: add rdx,0x10 ;*i2l
{reexecute=0 rethrow=0 return_oop=0}
│ ; -
org.sample.MemoryHandlesTest::intHandleTest@35 (line 35)
│ ; -
org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub@17
(line 191)
0.06% │ 0x00007faeeff7ded6: cmp rdx,rdi
│ 0x00007faeeff7ded9: jg 0x00007faeeff7df94 ;*ifle
{reexecute=0 rethrow=0 return_oop=0}
│ ; -
jdk.internal.foreign.MemorySegmentImpl::checkBounds@20 (line 196)
│ ; -
jdk.internal.foreign.MemorySegmentImpl::checkRange@29 (line 178)
│ ; -
jdk.internal.foreign.MemoryAddressImpl::checkAccess@21 (line 84)
│ ; -
java.lang.invoke.VarHandleMemoryAddressAsInts::checkAddress@15 (line 50)
│ ; -
java.lang.invoke.VarHandleMemoryAddressAsInts::set0@7 (line 85)
│ ; -
java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800bc3840::set@7
│ ; -
java.lang.invoke.VarHandleGuards::guard_LI_V@33 (line 114)
│ ; -
org.sample.MemoryHandlesTest::intHandleTest@42 (line 35)
│ ; -
org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub@17
(line 191)
│ 0x00007faeeff7dedf: mov DWORD PTR [rsi+0x10],0x4
;*invokevirtual putIntUnaligned {reexecute=0 rethrow=0 return_oop=0}
│ ; -
jdk.internal.misc.Unsafe::putIntUnaligned@10 (line 3693)
│ ; -
java.lang.invoke.VarHandleMemoryAddressAsInts::set0@38 (line 86)
│ ; -
java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800bc3840::set@7
│ ; -
java.lang.invoke.VarHandleGuards::guard_LI_V@33 (line 114)
│ ; -
org.sample.MemoryHandlesTest::intHandleTest@42 (line 35)
│ ; -
org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub@17
(line 191)
for every store. In contrast, similar ByteBuffer code looks like:
0.08% ↗ 0x00007f3b5bf717c0: movsxd r13,r8d
0.16% │ 0x00007f3b5bf717c3: mov r14,rdx
│ 0x00007f3b5bf717c6: add r14,r13
1.00% │ 0x00007f3b5bf717c9: movsxd r13,r8d
0.04% │ 0x00007f3b5bf717cc: vmovdqu YMMWORD PTR [rdx+r13*1],ymm4
6.87% │ 0x00007f3b5bf717d2: vmovdqu YMMWORD PTR [r14+0x20],ymm4
5.77% │ 0x00007f3b5bf717d8: vmovdqu YMMWORD PTR [r14+0x40],ymm4
3.99% │ 0x00007f3b5bf717de: vmovdqu YMMWORD PTR [r14+0x60],ymm4
6.09% │ 0x00007f3b5bf717e4: vmovdqu YMMWORD PTR [r14+0x80],ymm4
4.97% │ 0x00007f3b5bf717ed: vmovdqu YMMWORD PTR [r14+0xa0],ymm4
4.93% │ 0x00007f3b5bf717f6: vmovdqu YMMWORD PTR [r14+0xc0],ymm4
5.07% │ 0x00007f3b5bf717ff: vmovdqu YMMWORD PTR [r14+0xe0],ymm4
4.87% │ 0x00007f3b5bf71808: vmovdqu YMMWORD PTR [r14+0x100],ymm4
7.39% │ 0x00007f3b5bf71811: vmovdqu YMMWORD PTR [r14+0x120],ymm4
5.19% │ 0x00007f3b5bf7181a: vmovdqu YMMWORD PTR [r14+0x140],ymm4
6.21% │ 0x00007f3b5bf71823: vmovdqu YMMWORD PTR [r14+0x160],ymm4
4.93% │ 0x00007f3b5bf7182c: vmovdqu YMMWORD PTR [r14+0x180],ymm4
5.69% │ 0x00007f3b5bf71835: vmovdqu YMMWORD PTR [r14+0x1a0],ymm4
11.28% │ 0x00007f3b5bf7183e: vmovdqu YMMWORD PTR [r14+0x1c0],ymm4
4.83% │ 0x00007f3b5bf71847: vmovdqu YMMWORD PTR
[r14+0x1e0],ymm4;*invokevirtual putIntUnaligned {reexecute=0 rethrow=0
return_oop=0}
│ ; -
jdk.internal.misc.Unsafe::putIntUnaligned@10 (line 3693)
│ ; -
java.nio.DirectByteBuffer::putInt@18 (line 860)
│ ; -
java.nio.DirectByteBuffer::putInt@12 (line 881)
│ ; -
org.sample.ByteBufferTest::floss@15 (line 34)
│ ; -
org.sample.ByteBufferTest::test@14 (line 42)
│ ; -
org.sample.generated.ByteBufferTest_test_jmhTest::test_avgt_jmhStub@17 (line
241)
2.85% │ 0x00007f3b5bf71850: add r8d,0x200 ;*iinc
{reexecute=0 rethrow=0 return_oop=0}
│ ; -
org.sample.ByteBufferTest::floss@19 (line 33)
│ ; -
org.sample.ByteBufferTest::test@14 (line 42)
│ ; -
org.sample.generated.ByteBufferTest_test_jmhTest::test_avgt_jmhStub@17 (line
241)
│ 0x00007f3b5bf71857: cmp r8d,ecx
╰ 0x00007f3b5bf7185a: jl 0x00007f3b5bf717c0 ;*goto
{reexecute=0 rethrow=0 return_oop=0}
nice, eh?
Benchmark Mode Cnt Score Error Units
ByteBufferTest.test avgt 5 620.628 ± 2.947 ns/op
MemoryHandlesTest.intHandleTest avgt 5 2778.602 ± 10557.068 ns/op
Could it be that some C2 improvements or similar are proposed?