On Tue, 12 Nov 2024 10:03:50 GMT, Emanuel Peter <[email protected]> wrote:

>>> Thanks @minborg for this :) Please remember to add the misprediction count 
>>> if you can and avoid the bulk methods by having a `nextMemorySegment()` 
>>> benchmark method which make a single fill call site to observe the 
>>> different segments (types).
>>> 
>>> Having separate call-sites which observe always the same type(s) "could" be 
>>> too lucky (and gentle) for the runtime (and CHA) and would favour to have a 
>>> single address entry (or few ones, if we include any optimization for the 
>>> fill size) in the Branch Target Buffer of the cpu.
>> 
>> I've added a "mixed" benchmark. I am not sure I understood all of your 
>> comments but given my changes, maybe you could elaborate a bit more?
>
> @minborg sent me some logs from his machine, and I'm analyzing them now.
> 
> Basically, I'm trying to see why your Java code is a bit faster than the Loop 
> code.
> 
> ----------------
> 
>   44.77%                c2, level 4  
> org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop,
>  version 4, compile id 946
>   24.43%                c2, level 4  
> org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop,
>  version 4, compile id 946
>   21.80%                c2, level 4  
> org.openjdk.bench.java.lang.foreign.SegmentBulkRandomFill::heapSegmentFillLoop,
>  version 4, compile id 946
> 
> There seem to be 3 hot regions.
> 
> **main-loop** (region has 44.77%):
> 
>              ;; B33: #  out( B33 B34 ) &lt;- in( B32 B33 ) Loop( B33-B33 
> inner main of N116 strip mined) Freq: 4.62951e+10                             
>                              
>    0.50%  ?   0x00000001149a23c0:   sxtw        x20, w4                       
>                                                                               
>                         
>           ?   0x00000001149a23c4:   add x22, x16, x20                         
>                                                                               
>                         
>    0.02%  ?   0x00000001149a23c8:   str q16, [x22]                            
>                                                                               
>                         
>   16.33%  ?   0x00000001149a23cc:   str q16, [x22, #16]             
> ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}                  
>                                   
>           ?                                                             ; - 
> jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)           
>                           
>           ?                                                             ; - 
> jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)                    
>                           
>           ?                                                             ; - 
> java.lang.invoke.VarHandleSegmentAsBytes::set@38 (line 114)                   
>                           
>           ?                                                             ; - 
> java.lang.invoke.LambdaForm$DMH/0x000000013a4d5800::invokeStatic@20           
>                           
>           ?                                                             ; - 
> java.lang.invoke.LambdaForm$MH/0x000000013a4d8070::invoke@37                  
>                           
>           ?   ...

@eme64  not an expert with ARM, but profiling skidding due to modern big 
pipelined OOO CPUs is rather frequent

> with a strange extra add that has some strange looking percentage (profile 
> inaccuracy?):

you should check some instr below it to get the real culprit

More info on this topic are:
- https://travisdowns.github.io/blog/2019/08/20/interrupts.html for x86
- 
https://easyperf.net/blog/2018/06/08/Advanced-profiling-topics-PEBS-and-LBR#processor-event-based-sampling-pebs
- https://ieeexplore.ieee.org/document/10068807 - Intel and AMD PEBS/IBS paper

If you uses Intel/AMD and PEBS/IBS (if supported by your cpu) you can run 
perfasm to use precise events via `perfasm:events=cycles:P` IIRC (or adding 
more Ps? @shipilev likely knows) which should have way less skidding and will 
simplify these analysis.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22010#issuecomment-2470134089

Reply via email to