Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Galder Zamarreño Fri, 07 Feb 2025 04:31:49 -0800

On Fri, 17 Jan 2025 17:53:24 GMT, Galder Zamarreño <gal...@openjdk.org> wrote:


>> This patch intrinsifies `Math.max(long, long)` and `Math.min(long, long)` in 
>> order to help improve vectorization performance.
>> 
>> Currently vectorization does not kick in for loops containing either of 
>> these calls because of the following error:
>> 
>> 
>> VLoop::check_preconditions: failed: control flow in loop not allowed
>> 
>> 
>> The control flow is due to the java implementation for these methods, e.g.
>> 
>> 
>> public static long max(long a, long b) {
>>     return (a >= b) ? a : b;
>> }
>> 
>> 
>> This patch intrinsifies the calls to replace the CmpL + Bool nodes for 
>> MaxL/MinL nodes respectively.
>> By doing this, vectorization no longer finds the control flow and so it can 
>> carry out the vectorization.
>> E.g.
>> 
>> 
>> SuperWord::transform_loop:
>>     Loop: N518/N126  counted [int,int),+4 (1025 iters)  main has_sfpt 
>> strip_mined
>>  518  CountedLoop  === 518 246 126  [[ 513 517 518 242 521 522 422 210 ]] 
>> inner stride: 4 main of N518 strip mined !orig=[419],[247],[216],[193] 
>> !jvms: Test::test @ bci:14 (line 21)
>> 
>> 
>> Applying the same changes to `ReductionPerf` as in 
>> https://github.com/openjdk/jdk/pull/13056, we can compare the results before 
>> and after. Before the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1155
>> long max   1173
>> 
>> 
>> After the patch, on darwin/aarch64 (M1):
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR
>>    jtreg:test/hotspot/jtreg/compiler/loopopts/superword/ReductionPerf.java
>>                                                          1     1     0     0
>> ==============================
>> TEST SUCCESS
>> 
>> long min   1042
>> long max   1042
>> 
>> 
>> This patch does not add an platform-specific backend implementations for the 
>> MaxL/MinL nodes.
>> Therefore, it still relies on the macro expansion to transform those into 
>> CMoveL.
>> 
>> I've run tier1 and hotspot compiler tests on darwin/aarch64 and got these 
>> results:
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PA...
>
> Galder Zamarreño has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Fix typo

@eastig is helping with the results on aarch64, so I will verify the numbers in 
same way done below for x86_64 once he provides me with the results.

Here is a summary of the benchmarking results I'm seeing on x86_64 (I will push 
an update that just merges the latest master shortly).

First I will go through the results of `MinMaxVector`. This benchmark computes 
throughput by default so the higher the number the better.

# MinMaxVector AVX-512

Following are results with AVX-512 instructions:

Benchmark                       (probability)  (range)  (seed)  (size)   Mode  
Cnt   Baseline     Patch   Units
MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt   
 4    834.127  3688.961  ops/ms
MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt   
 4   1147.010  3687.721  ops/ms
MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt   
 4   1126.718  1072.812  ops/ms
MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt   
 4   1070.921  1070.538  ops/ms
MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt   
 4    510.483  1073.081  ops/ms
MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt   
 4    935.658  1016.910  ops/ms
MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt   
 4   1007.410   933.774  ops/ms
MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt   
 4    536.582  1017.337  ops/ms
MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt   
 4    967.288   966.945  ops/ms
MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt   
 4    967.327   967.382  ops/ms
MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt   
 4    849.689   967.327  ops/ms
MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt   
 4    966.323   967.275  ops/ms
MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt   
 4    967.340   967.228  ops/ms
MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt   
 4    880.921   967.233  ops/ms


### `longReduction[Min|Max]` performance improves slightly when probability is 
100

Without the patch the code uses compare instructions:


   7.83%  ││││ │││↗  │           0x00007f4f700fb305:   imulq            $0xb, 
0x20(%r14, %r8, 8), %rdi
          ││││ ││││  │                                                          
           ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││││ ││││  │                                                          
           ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@24 
(line 255)
          ││││ ││││  │                                                          
           ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)
   5.64%  ││││ ││││  │           0x00007f4f700fb30b:   cmpq             %rdi, 
%rdx
          ││││╭││││  │           0x00007f4f700fb30e:   jge              
0x7f4f700fb32c      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││││││││  │                                                          
           ; - java.lang.Math::max@11 (line 2037)
          │││││││││  │                                                          
           ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@30 
(line 256)
          │││││││││  │                                                          
           ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)
  12.82%  │││││││││↗ │           0x00007f4f700fb310:   imulq            $0xb, 
0x28(%r14, %r8, 8), %rbp
          ││││││││││ │                                                          
           ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││││││││││ │                                                          
           ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@24 
(line 255)
          ││││││││││ │                                                          
           ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)
   7.46%  ││││││││││ │           0x00007f4f700fb316:   cmpq             %rbp, 
%rdi
          │││││╰││││ │           0x00007f4f700fb319:   jl               
0x7f4f700fb2e0      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          │││││ ││││ │                                                          
           ; - java.lang.Math::max@3 (line 2037)
          │││││ ││││ │                                                          
           ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@30 
(line 256)
          │││││ ││││ │                                                          
           ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)


And with the patch these become vectorized:


          │    ││ ↗││││  0x00007f56280fad10:   vpmullq          0xf0(%rdx, 
%rsi, 8), %ymm10, %ymm4
   8.35%  │    ││ │││││  0x00007f56280fad1b:   vpmullq          0xd0(%rdx, 
%rsi, 8), %ymm10, %ymm5
   4.27%  │    ││ │││││  0x00007f56280fad26:   vpmullq          0x10(%rdx, 
%rsi, 8), %ymm10, %ymm6
          │    ││ │││││                                                         
   ;   {no_reloc}
   4.22%  │    ││ │││││  0x00007f56280fad31:   vpmullq          0x30(%rdx, 
%rsi, 8), %ymm10, %ymm7
   4.00%  │    ││ │││││  0x00007f56280fad3c:   vpmullq          0xb0(%rdx, 
%rsi, 8), %ymm10, %ymm8
   4.13%  │    ││ │││││  0x00007f56280fad47:   vpmullq          0x50(%rdx, 
%rsi, 8), %ymm10, %ymm11
   4.10%  │    ││ │││││  0x00007f56280fad52:   vpmullq          0x70(%rdx, 
%rsi, 8), %ymm10, %ymm12
   4.13%  │    ││ │││││  0x00007f56280fad5d:   vpmullq          0x90(%rdx, 
%rsi, 8), %ymm10, %ymm13
   4.03%  │    ││ │││││  0x00007f56280fad68:   vpmaxsq          %ymm6, %ymm3, 
%ymm3
          │    ││ │││││  0x00007f56280fad6e:   vpmaxsq          %ymm7, %ymm3, 
%ymm3
   4.72%  │    ││ │││││  0x00007f56280fad74:   vpmaxsq          %ymm11, %ymm3, 
%ymm3
          │    ││ │││││  0x00007f56280fad7a:   vpmaxsq          %ymm12, %ymm3, 
%ymm3
   8.40%  │    ││ │││││  0x00007f56280fad80:   vpmaxsq          %ymm13, %ymm3, 
%ymm3
  23.11%  │    ││ │││││  0x00007f56280fad86:   vpmaxsq          %ymm8, %ymm3, 
%ymm3
   2.15%  │    ││ │││││  0x00007f56280fad8c:   vpmaxsq          %ymm5, %ymm3, 
%ymm3
   8.79%  │    ││ │││││  0x00007f56280fad92:   vpmaxsq          %ymm4, %ymm3, 
%ymm3 ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          │    ││ │││││                                                         
   ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@30 (line 256)
          │    ││ │││││                                                         
   ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)


### `longLoop[Min|Max]` performance improves considerably when probability is 
100

Without the patch the code uses compare + move instructions:


   4.53%  ││││  ││  │ │           0x00007f96b40faf33:   movq            
0x18(%rax, %rsi, 8), %r13;*laload {reexecute=0 rethrow=0 return_oop=0}
          ││││  ││  │ │                                                         
            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax@20 (line 
236)
          ││││  ││  │ │                                                         
            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub@19
 (line 124)
   2.69%  ││││  ││  │ │           0x00007f96b40faf38:   cmpq            %r11, 
%r13
          ││││╭ ││  │ │           0x00007f96b40faf3b:   jl              
0x7f96b40faf67      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││││ ││  │ │                                                         
            ; - java.lang.Math::max@11 (line 2037)
          │││││ ││  │ │                                                         
            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax@27 (line 
236)
          │││││ ││  │ │                                                         
            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub@19
 (line 124)
   8.75%  │││││ ││↗ │ │           0x00007f96b40faf3d:   movq            %r13, 
0x18(%rbp, %rsi, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          │││││ │││ │ │                                                         
            ; - org.openjdk.bench.java.lang.MinMaxVector::longLoopMax@30 (line 
236)
          │││││ │││ │ │                                                         
            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub@19
 (line 124)


And with the patch those become vectorized:


   3.55%  │  ││  0x00007f13c80fa18a:   vmovdqu          0xf0(%rbx, %r10, 8), 
%ymm5
          │  ││  0x00007f13c80fa194:   vmovdqu          0xf0(%rdi, %r10, 8), 
%ymm6
   2.35%  │  ││  0x00007f13c80fa19e:   vpmaxsq          %ymm6, %ymm5, %ymm5
   5.03%  │  ││  0x00007f13c80fa1a4:   vmovdqu          %ymm5, 0xf0(%rax, %r10, 
8)
          │  ││                                                            
;*lastore {reexecute=0 rethrow=0 return_oop=0}
          │  ││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxVector::longLoopMax@30 (line 236)
          │  ││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub@19
 (line 124)


It's interesting to observe that at probabilites of 50/80% the baseline 
performs better than at 100%. The reason for that is because at 50/80% the 
baseline already vectorizes. So, why isn't the baseline vectorizing at 100% 
probability?


VLoop::check_preconditions
      Loop: N1256/N463  limit_check counted [int,int),+4 (3161 iters)  main rc  
has_sfpt strip_mined
 1256  CountedLoop  === 1256 598 463  [[ 1256 1257 1271 1272 ]] inner stride: 4 
main of N1256 strip mined !orig=[1126],[599],[590],[307] !jvms: 
MinMaxVector::longLoopMax @ bci:10 (line 236) 
MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
VLoop::check_preconditions: fails because of control flow.
  cl_exit 594  594  CountedLoopEnd  === 415 593  [[ 1275 463 ]] [lt] 
P=0.999684, C=707717.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ 
bci:7 (line 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ 
bci:19 (line 124)
  cl_exit->in(0) 415  415  Region  === 415 411 412  [[ 415 594 416 451 ]]  
!orig=[423] !jvms: Math::max @ bci:11 (line 2037) MinMaxVector::longLoopMax @ 
bci:27 (line 236) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ 
bci:19 (line 124)
  lpt->_head 1256 1256  CountedLoop  === 1256 598 463  [[ 1256 1257 1271 1272 
]] inner stride: 4 main of N1256 strip mined !orig=[1126],[599],[590],[307] 
!jvms: MinMaxVector::longLoopMax @ bci:10 (line 236) 
MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 124)
      Loop: N1256/N463  limit_check counted [int,int),+4 (3161 iters)  main rc  
has_sfpt strip_mined
VLoop::check_preconditions: failed: control flow in loop not allowed


At 100% probability baseline fails to vectorize because it observes a control 
flow. This control flow is not the one you see in min/max implementations, but 
this is one added by HotSpot as a result of the JIT profiling. It observes that 
one branch is always taken so it optimizes for that, and adds a branch for the 
uncommon case where the branch is not taken.

### `longClippingRange` performance improves considerably

Without the patch the code uses compare + move instructions:


   3.39%  ││ │      ││ │            0x00007febb40fb175:   cmpq          %rbp, 
%rcx
          ││ │╭     ││ │            0x00007febb40fb178:   jge           
0x7febb40fb17d      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          ││ ││     ││ │                                                        
              ; - java.lang.Math::max@3 (line 2037)
          ││ ││     ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.MinMaxVector::longClippingRange@25 (line 220)
          ││ ││     ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)
   2.69%  ││ ││     ││ │            0x00007febb40fb17a:   movq          %rbp, 
%rcx          ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ││ ││     ││ │                                                        
              ; - java.lang.Math::max@11 (line 2037)
          ││ ││     ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.MinMaxVector::longClippingRange@25 (line 220)
          ││ ││     ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)
   4.35%  ││ │↘     ││ │            0x00007febb40fb17d:   nop
   2.93%  ││ │      ││ │            0x00007febb40fb180:   cmpq          %r8, 
%rcx
          ││ │ ╭    ││ │            0x00007febb40fb183:   jle           
0x7febb40fb188      ;*ifgt {reexecute=0 rethrow=0 return_oop=0}
          ││ │ │    ││ │                                                        
              ; - java.lang.Math::min@3 (line 2132)
          ││ │ │    ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.MinMaxVector::longClippingRange@32 (line 220)
          ││ │ │    ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)
   3.51%  ││ │ │    ││ │            0x00007febb40fb185:   movq          %r8, 
%rcx           ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ││ │ │    ││ │                                                        
              ; - java.lang.Math::min@11 (line 2132)
          ││ │ │    ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.MinMaxVector::longClippingRange@32 (line 220)
          ││ │ │    ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)
   4.26%  ││ │ ↘    ││ │            0x00007febb40fb188:   movq          %rcx, 
0x10(%rsi, %r9, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          ││ │      ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.MinMaxVector::longClippingRange@35 (line 220)
          ││ │      ││ │                                                        
              ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)


With the patch these become vectorized:


   0.20%  ││↗        ↗   0x00007f10180fd15c:   vmovdqu          0x10(%r11, 
%rcx, 8), %ymm6
          │││        │   0x00007f10180fd163:   vpmaxsq          %ymm6, %ymm7, 
%ymm6
          │││        │   0x00007f10180fd169:   vpminsq          %ymm8, %ymm6, 
%ymm6
          │││        │   0x00007f10180fd16f:   vmovdqu          %ymm6, 
0x10(%r8, %rcx, 8);*lastore {reexecute=0 rethrow=0 return_oop=0}
          │││        │                                                          
   ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange@35 (line 220)
          │││        │                                                          
   ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)


# `MinMaxVector` AVX2

Following are results on the same machine as above but forcing AVX2 to be used 
instead of AVX-512:

Benchmark                       (probability)  (range)  (seed)  (size)   Mode  
Cnt  Baseline     Patch   Units
MinMaxVector.longClippingRange            N/A       90       0    1000  thrpt   
 4   832.132  1813.609  ops/ms
MinMaxVector.longClippingRange            N/A      100       0    1000  thrpt   
 4   832.546  1814.477  ops/ms
MinMaxVector.longLoopMax                   50      N/A     N/A    2048  thrpt   
 4   938.372   939.313  ops/ms
MinMaxVector.longLoopMax                   80      N/A     N/A    2048  thrpt   
 4   934.964   945.124  ops/ms
MinMaxVector.longLoopMax                  100      N/A     N/A    2048  thrpt   
 4   512.076   937.287  ops/ms
MinMaxVector.longLoopMin                   50      N/A     N/A    2048  thrpt   
 4   999.455   689.750  ops/ms
MinMaxVector.longLoopMin                   80      N/A     N/A    2048  thrpt   
 4  1000.352   876.326  ops/ms
MinMaxVector.longLoopMin                  100      N/A     N/A    2048  thrpt   
 4   536.359   999.475  ops/ms
MinMaxVector.longReductionMax              50      N/A     N/A    2048  thrpt   
 4   409.413   409.363  ops/ms
MinMaxVector.longReductionMax              80      N/A     N/A    2048  thrpt   
 4   409.374   409.141  ops/ms
MinMaxVector.longReductionMax             100      N/A     N/A    2048  thrpt   
 4   883.614   409.318  ops/ms
MinMaxVector.longReductionMin              50      N/A     N/A    2048  thrpt   
 4   404.723   404.705  ops/ms
MinMaxVector.longReductionMin              80      N/A     N/A    2048  thrpt   
 4   404.755   404.748  ops/ms
MinMaxVector.longReductionMin             100      N/A     N/A    2048  thrpt   
 4   848.784   404.669  ops/ms


### `longClippingRange` performance improves considerably

Baseline uses compare + move instructions as shown above. But the patched 
version improves in spite of not being able to use AVX-512 instructions such as 
`vpmaxsq`. The performance improvements come from using other vectorized 
compare + vectorized move instructions:


          │    │   ││││  0x00007f9aa40f94ac:   vpcmpgtq         %ymm6, %ymm7, 
%ymm12
   3.79%  │    │   ││││  0x00007f9aa40f94b1:   vblendvpd                %ymm12, 
%ymm7, %ymm6, %ymm12
   3.72%  │    │   ││││  0x00007f9aa40f94b7:   vpcmpgtq         %ymm8, %ymm12, 
%ymm10
          │    │   ││││  0x00007f9aa40f94bc:   vblendvpd                %ymm10, 
%ymm8, %ymm12, %ymm10
   3.78%  │    │   ││││  0x00007f9aa40f94c2:   vmovdqu          %ymm10, 
0xf0(%r8, %rcx, 8)
          │    │   ││││                                                         
   ;*lastore {reexecute=0 rethrow=0 return_oop=0}
          │    │   ││││                                                         
   ; - org.openjdk.bench.java.lang.MinMaxVector::longClippingRange@35 (line 220)
          │    │   ││││                                                         
   ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longClippingRange_jmhTest::longClippingRange_thrpt_jmhStub@19
 (line 124)


### `longReduction[Min|Max]` performance drops considerably when probability is 
100

Baseline uses compare + move instruction to implement this:


          ││││ ││││  │                                                          
           ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││││ ││││  │                                                          
           ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@24 
(line 255)
          ││││ ││││  │                                                          
           ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)
   6.30%  ││││ ││││  │           0x00007fd5580f678b:   cmpq             %rdi, 
%rdx
          ││││╭││││  │           0x00007fd5580f678e:   jge              
0x7fd5580f67ac      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││││││││  │                                                          
           ; - java.lang.Math::max@11 (line 2037)
          │││││││││  │                                                          
           ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@30 
(line 256)
          │││││││││  │                                                          
           ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)
  12.88%  │││││││││↗ │           0x00007fd5580f6790:   imulq            $0xb, 
0x28(%r14, %r8, 8), %rbp
          ││││││││││ │                                                          
           ;*lmul {reexecute=0 rethrow=0 return_oop=0}
          ││││││││││ │                                                          
           ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@24 
(line 255)
          ││││││││││ │                                                          
           ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)
   7.55%  ││││││││││ │           0x00007fd5580f6796:   cmpq             %rbp, 
%rdi
          │││││╰││││ │           0x00007fd5580f6799:   jl               
0x7fd5580f6760      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          │││││ ││││ │                                                          
           ; - java.lang.Math::max@3 (line 2037)
          │││││ ││││ │                                                          
           ; - org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@30 
(line 256)
          │││││ ││││ │                                                          
           ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)


With the patch the code uses conditional moves instead:


   0.05%  ↗│  0x00007fc4700f5253:   imulq               $0xb, 0x28(%r14, %r11, 
8), %rdx
  10.62%  ││  0x00007fc4700f5259:   imulq               $0xb, 0x20(%r14, %r11, 
8), %rax
   0.63%  ││  0x00007fc4700f525f:   imulq               $0xb, 0x10(%r14, %r11, 
8), %r8
          ││                                                            ;*lmul 
{reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@24 (line 255)
          ││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)
  10.34%  ││  0x00007fc4700f5265:   cmpq                %r8, %r13
   2.37%  ││  0x00007fc4700f5268:   cmovlq              %r8, %r13           
;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@30 (line 256)
          ││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)
   1.15%  ││  0x00007fc4700f526c:   imulq               $0xb, 0x18(%r14, %r11, 
8), %r8
          ││                                                            ;*lmul 
{reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@24 (line 255)
          ││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)
   9.28%  ││  0x00007fc4700f5272:   cmpq                %r8, %r13
   3.82%  ││  0x00007fc4700f5275:   cmovlq              %r8, %r13
  21.61%  ││  0x00007fc4700f5279:   cmpq                %rax, %r13
  11.55%  ││  0x00007fc4700f527c:   cmovlq              %rax, %r13
   4.48%  ││  0x00007fc4700f5280:   cmpq                %rdx, %r13
  11.76%  ││  0x00007fc4700f5283:   cmovlq              %rdx, %r13          
;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ││                                                            ; - 
org.openjdk.bench.java.lang.MinMaxVector::longReductionMax@30 (line 256)
          ││                                                            ; - 
org.openjdk.bench.java.lang.jmh_generated.MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub@19
 (line 124)


When one of the branches is taken always or almost always, the branched code of 
baseline can be optimized with branch prediction. However, the conditional move 
instructions force the CPU to compute both sides of the branch, so it performs 
worse in this scenario.

Why vectorized instructions are not used in this scenario? Vector instructions 
for min/max are not available with AVX2 and the trace vectorization signals it:


PackSet::print: 3 packs
 Pack: 0
    0:  1119  LoadL  === 1105 343 1120  [[ 1117 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=997,663,[457] !jvms: 
MinMaxVector::longReductionMax @ bci:23 (line 255) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    1:  1112  LoadL  === 1105 343 1113  [[ 1111 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=663,[457] !jvms: 
MinMaxVector::longReductionMax @ bci:23 (line 255) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    2:   997  LoadL  === 1105 343 998  [[ 996 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=663,[457] !jvms: 
MinMaxVector::longReductionMax @ bci:23 (line 255) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    3:   663  LoadL  === 1105 343 455  [[ 458 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=[457] !jvms: 
MinMaxVector::longReductionMax @ bci:23 (line 255) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
 Pack: 1
    0:  1117  MulL  === _ 1119 162  [[ 1116 ]]  !orig=996,458 !jvms: 
MinMaxVector::longReductionMax @ bci:24 (line 255) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    1:  1111  MulL  === _ 1112 162  [[ 1110 ]]  !orig=458 !jvms: 
MinMaxVector::longReductionMax @ bci:24 (line 255) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    2:   996  MulL  === _ 997 162  [[ 995 ]]  !orig=458 !jvms: 
MinMaxVector::longReductionMax @ bci:24 (line 255) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    3:   458  MulL  === _ 663 162  [[ 459 ]]  !jvms: 
MinMaxVector::longReductionMax @ bci:24 (line 255) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
 Pack: 2
    0:  1116  MaxL  === _ 1128 1117  [[ 1110 ]]  !orig=995,459,1012 !jvms: 
MinMaxVector::longReductionMax @ bci:30 (line 256) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    1:  1110  MaxL  === _ 1116 1111  [[ 995 ]]  !orig=459,1012 !jvms: 
MinMaxVector::longReductionMax @ bci:30 (line 256) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    2:   995  MaxL  === _ 1110 996  [[ 459 ]]  !orig=459,1012 !jvms: 
MinMaxVector::longReductionMax @ bci:30 (line 256) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    3:   459  MaxL  === _ 995 458  [[ 1128 923 570 ]]  !orig=1012 !jvms: 
MinMaxVector::longReductionMax @ bci:30 (line 256) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)

WARNING: Removed pack: not implemented at any smaller size:
    0:  1116  MaxL  === _ 1128 1117  [[ 1110 ]]  !orig=995,459,1012 !jvms: 
MinMaxVector::longReductionMax @ bci:30 (line 256) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    1:  1110  MaxL  === _ 1116 1111  [[ 995 ]]  !orig=459,1012 !jvms: 
MinMaxVector::longReductionMax @ bci:30 (line 256) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    2:   995  MaxL  === _ 1110 996  [[ 459 ]]  !orig=459,1012 !jvms: 
MinMaxVector::longReductionMax @ bci:30 (line 256) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)
    3:   459  MaxL  === _ 995 458  [[ 1128 923 570 ]]  !orig=1012 !jvms: 
MinMaxVector::longReductionMax @ bci:30 (line 256) 
MinMaxVector_longReductionMax_jmhTest::longReductionMax_thrpt_jmhStub @ bci:19 
(line 124)

After SuperWord::split_packs_only_implemented_with_smaller_size


One interesting question option to explore here would be if MaxL/MinL could be 
implemented in terms of vectorized compare instructions, as shown above in the 
`longClippingRange` scenario. Thoughts @rwestrel @eme64?

# `VectorReduction2.WithSuperword` on AVX-512 machine

As requested by Emanuel I've also run this benchmark. Note that the results 
here are time per op, so the lower the number the better:


Benchmark                                         (SIZE)  (seed)  Mode  Cnt  
Baseline     Patch  Units
VectorReduction2.WithSuperword.longMaxBig           2048       0  avgt    3  
3970.527  1918.821  ns/op
VectorReduction2.WithSuperword.longMaxDotProduct    2048       0  avgt    3  
1369.634  1055.762  ns/op
VectorReduction2.WithSuperword.longMaxSimple        2048       0  avgt    3   
722.314  2172.064  ns/op
VectorReduction2.WithSuperword.longMinBig           2048       0  avgt    3  
3996.694  1918.398  ns/op
VectorReduction2.WithSuperword.longMinDotProduct    2048       0  avgt    3  
1363.687  1056.375  ns/op
VectorReduction2.WithSuperword.longMinSimple        2048       0  avgt    3   
718.150  2179.478  ns/op


`long[Min|Max]Big` and `long[Min|Max]DotProduct` benchmarks show considerable 
improvements,
but something odd is happening in `long[Min|Max]Simple`.

### `long[Min|Max]Simple` performance drops considerably

Baseline uses compare + moves instructions:


   8.05%  ││      ││↗       │    0x00007f9d580f569b:   movq             
0x18(%r13, %r11, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0}
          ││      │││       │                                                   
           ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple@22 
(line 1054)
          ││      │││       │                                                   
           ; - 
org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub@17
 (line 190)
   0.23%  ││      │││       │    0x00007f9d580f56a0:   cmpq             %r8, 
%rsi
          ││╭     │││       │    0x00007f9d580f56a3:   jl               
0x7f9d580f5713      ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          │││     │││       │                                                   
           ; - java.lang.Math::max@11 (line 2037)
          │││     │││       │                                                   
           ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple@28 
(line 1055)
          │││     │││       │                                                   
           ; - 
org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub@17
 (line 190)


Patched version uses conditional moves instead of vectorized instructions:


   2.76%  ││    0x00007fcd180f695c:   movq              0x18(%r14, %r11, 8), 
%rdi;*laload {reexecute=0 rethrow=0 return_oop=0}
          ││                                                              ; - 
org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple@22 (line 1054)
          ││                                                              ; - 
org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub@17
 (line 190)
          ││    0x00007fcd180f6961:   cmpq              %rdi, %r13
   3.11%  ││    0x00007fcd180f6964:   cmovlq            %rdi, %r13          
;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          ││                                                              ; - 
org.openjdk.bench.vm.compiler.VectorReduction2::longMaxSimple@28 (line 1055)
          ││                                                              ; - 
org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub@17
 (line 190)


Why are vectorized instructions not kicking in with patch? Because superword 
doesn't think it's profitable to vectorize this:


PackSet::print: 2 packs
 Pack: 0
    0:  733  LoadL  === 721 184 734  [[ 732 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=669,500,[319] !jvms: 
VectorReduction2::longMaxSimple @ bci:22 (line 1054) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    1:  728  LoadL  === 721 184 729  [[ 727 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=500,[319] !jvms: 
VectorReduction2::longMaxSimple @ bci:22 (line 1054) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    2:  669  LoadL  === 721 184 670  [[ 668 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=500,[319] !jvms: 
VectorReduction2::longMaxSimple @ bci:22 (line 1054) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    3:  500  LoadL  === 721 184 317  [[ 320 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=[319] !jvms: 
VectorReduction2::longMaxSimple @ bci:22 (line 1054) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
 Pack: 1
    0:  732  MaxL  === _ 743 733  [[ 727 ]]  !orig=668,320,685 !jvms: 
VectorReduction2::longMaxSimple @ bci:28 (line 1055) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    1:  727  MaxL  === _ 732 728  [[ 668 ]]  !orig=320,685 !jvms: 
VectorReduction2::longMaxSimple @ bci:28 (line 1055) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    2:  668  MaxL  === _ 727 669  [[ 320 ]]  !orig=320,685 !jvms: 
VectorReduction2::longMaxSimple @ bci:28 (line 1055) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    3:  320  MaxL  === _ 668 500  [[ 743 593 456 ]]  !orig=685 !jvms: 
VectorReduction2::longMaxSimple @ bci:28 (line 1055) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)

WARNING: Removed pack: not profitable:
    0:  732  MaxL  === _ 743 733  [[ 727 ]]  !orig=668,320,685 !jvms: 
VectorReduction2::longMaxSimple @ bci:28 (line 1055) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    1:  727  MaxL  === _ 732 728  [[ 668 ]]  !orig=320,685 !jvms: 
VectorReduction2::longMaxSimple @ bci:28 (line 1055) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    2:  668  MaxL  === _ 727 669  [[ 320 ]]  !orig=320,685 !jvms: 
VectorReduction2::longMaxSimple @ bci:28 (line 1055) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    3:  320  MaxL  === _ 668 500  [[ 743 593 456 ]]  !orig=685 !jvms: 
VectorReduction2::longMaxSimple @ bci:28 (line 1055) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)

WARNING: Removed pack: not profitable:
    0:  733  LoadL  === 721 184 734  [[ 732 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=669,500,[319] !jvms: 
VectorReduction2::longMaxSimple @ bci:22 (line 1054) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    1:  728  LoadL  === 721 184 729  [[ 727 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=500,[319] !jvms: 
VectorReduction2::longMaxSimple @ bci:22 (line 1054) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    2:  669  LoadL  === 721 184 670  [[ 668 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=500,[319] !jvms: 
VectorReduction2::longMaxSimple @ bci:22 (line 1054) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)
    3:  500  LoadL  === 721 184 317  [[ 320 ]]  @long[int:>=0] 
(java/lang/Cloneable,java/io/Serializable):exact+any *, idx=8; #long (does not 
depend only on test, unknown control) !orig=[319] !jvms: 
VectorReduction2::longMaxSimple @ bci:22 (line 1054) 
VectorReduction2_WithSuperword_longMaxSimple_jmhTest::longMaxSimple_avgt_jmhStub
 @ bci:17 (line 190)

After Superword::filter_packs_for_profitable

PackSet::print: 0 packs

SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize


How can you make it vectorize? By doing something with the value in the array 
before passing it to min/max. That is what 
`MinMaxVector.longReduction[Min|Max]` and 
`VectorReduction2.long[Min|Max]DotProduct` methods do.

# `VectorReduction2.NoSuperword` on AVX-512 machine


Benchmark                                       (SIZE)  (seed)  Mode  Cnt  
Baseline     Patch  Units
VectorReduction2.NoSuperword.longMaxBig           2048       0  avgt    3  
3964.403  2966.258  ns/op
VectorReduction2.NoSuperword.longMaxDotProduct    2048       0  avgt    3  
1686.373  2462.876  ns/op
VectorReduction2.NoSuperword.longMaxSimple        2048       0  avgt    3   
722.219  2171.859  ns/op
VectorReduction2.NoSuperword.longMinBig           2048       0  avgt    3  
3994.685  2971.143  ns/op
VectorReduction2.NoSuperword.longMinDotProduct    2048       0  avgt    3  
1366.291  2428.173  ns/op
VectorReduction2.NoSuperword.longMinSimple        2048       0  avgt    3   
719.218  2179.546  ns/op


Performance improves or `long[Min|Max]Big`. `long[Min|Max]Simple` suffers 
similar issues as shown in previous section because when not vectorized, these 
benchmarks fallback on conditional moves. The drop in performance in 
`long[Min|Max]DotProduct` needs some explanation.

### `long[Min|Max]DotProduct` performance drops considerably

Baseline uses compare + move instructions here:


   5.67%  │││ │││↗  │    0x00007f3fcc0fa71d:   movq             0x20(%r14, %r8, 
8), %r9
   5.19%  │││ ││││  │    0x00007f3fcc0fa722:   imulq            0x20(%rax, %r8, 
8), %r9;*lmul {reexecute=0 rethrow=0 return_oop=0}
          │││ ││││  │                                                           
   ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct@30 
(line 1125)
          │││ ││││  │                                                           
   ; - 
org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub@17
 (line 190)
   8.46%  │││ ││││  │    0x00007f3fcc0fa728:   cmpq             %r9, %rsi
          │││╭││││  │    0x00007f3fcc0fa72b:   jl               0x7f3fcc0fa751  
    ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
          ││││││││  │                                                           
   ; - java.lang.Math::max@11 (line 2037)
          ││││││││  │                                                           
   ; - org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct@36 
(line 1126)
          ││││││││  │                                                           
   ; - 
org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub@17
 (line 190)


Patch transforms this into conditional moves:


  11.00%  │  0x00007f66f40f70b2:   movq         0x18(%r13, %rcx, 8), %rax
          │  0x00007f66f40f70b7:   imulq                0x18(%r9, %rcx, 8), 
%rax;*lmul {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - 
org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct@30 (line 1125)
          │                                                            ; - 
org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub@17
 (line 190)
          │  0x00007f66f40f70bd:   cmpq         %rdx, %rax
  13.07%  │  0x00007f66f40f70c0:   cmovlq               %rdx, %rax          
;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - 
org.openjdk.bench.vm.compiler.VectorReduction2::longMaxDotProduct@36 (line 1126)
          │                                                            ; - 
org.openjdk.bench.vm.compiler.jmh_generated.VectorReduction2_NoSuperword_longMaxDotProduct_jmhTest::longMaxDotProduct_avgt_jmhStub@17
 (line 190)


This is similar to what we have seen above. Lacking superword functionality, 
the fallback for MaxL/MinL implies using conditional moves. Although branch 
probabilities are not controlled here, we can observe that one of the branches 
is likely being taken ~100% of the time.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2642788364

Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Reply via email to