On Tue, 12 Nov 2024 10:17:45 GMT, Francesco Nigro <d...@openjdk.org> wrote:
>> @minborg sent me some logs from his machine, and I'm analyzing them now. >> >> Basically, I'm trying to see why your Java code is a bit faster than the >> Loop code. >> >> ---------------- >> >> 44.77% c2, level 4 >> org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, >> version 4, compile id 946 >> 24.43% c2, level 4 >> org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, >> version 4, compile id 946 >> 21.80% c2, level 4 >> org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop, >> version 4, compile id 946 >> >> There seem to be 3 hot regions. >> >> **main-loop** (region has 44.77%): >> >> ;; B33: # out( B33 B34 ) <- in( B32 B33 ) Loop( B33-B33 >> inner main of N116 strip mined) Freq: 4.62951e+10 >> >> 0.50% ? 0x00000001149a23c0: sxtw x20, w4 >> >> >> ? 0x00000001149a23c4: add x22, x16, x20 >> >> >> 0.02% ? 0x00000001149a23c8: str q16, [x22] >> >> >> 16.33% ? 0x00000001149a23cc: str q16, [x22, #16] >> ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0} >> >> ? ; - >> jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534) >> >> ? ; - >> jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522) >> >> ? ; - >> java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114) >> >> ? ; - >> java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20 >> >> ? ; - >> java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37 ... > > @eme64 not an expert with ARM, but profiling skidding due to modern big > pipelined OOO CPUs is rather frequent > >> with a strange extra add that has some strange looking percentage (profile >> inaccuracy?): > > you should check some instr below it to get the real culprit > > More info on this topic are: > - https://travisdowns.github.io/blog/2019/08/20/interrupts.html for x86 > - > https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR#processor-event-based-sampling-pebs > - https://ieeexplore.ieee.org/document/10068807 - Intel and AMD PEBS/IBS paper > > If you uses Intel/AMD and PEBS/IBS (if supported by your cpu) you can run > perfasm using precise events via `perfasm:events=cycles:P` IIRC (or adding > more Ps? @shipilev likely knows) which should have way less skidding and will > simplify these analysis. @franz1981 right. That is what I thought. I'm usually working on x64, and am not used to all the skidding of ARM. ------------- PR Comment: https://git.openjdk.org/jdk/pull/22010#issuecomment-2470162785