On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <rol...@openjdk.org> wrote:
>> @rwestrel @galderz >> >>> It seems overall, we likely win more than we loose with this intrinsic, so >>> I would integrate this change as it is and file a bug to keep track of >>> remaining issues. >> >> I'm a little scared to just accept the regressions, especially for this >> "most average looking case": >> Imagine you have an array with random numbers. Or at least numbers in a >> random order. If we take the max, then we expect the first number to be max >> with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the >> average branch probability is `n / (sum_i 1/i)`. This goes closer and closer >> to zero, the larger the array. This means that the "average" case has an >> extreme probability. And so if we do not vectorize, then this gets us a >> regression with the current patch. And vectorization is a little fragile, it >> only takes very little for vectorization not to kick in. >> >>> The Min/Max nodes are floating nodes. They can hoist out of loop and common >>> reliably in ways that are not guaranteed otherwise. >> >> I suppose we could write an optimization that can hoist out loop independent >> if-diamonds out of a loop. If the condition and all phi inputs are loop >> invariant, you could just cut the diamond out of the loop, and paste it >> before the loop entry. >> >>> Shouldn't int min/max be affected the same way? >> >> I think we should be able to see the same issue here, actually. Yes. Here a >> quick benchmark below: >> >> >> java -XX:CompileCommand=compileonly,TestIntMax::test* >> -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors >> TestIntMax.java >> CompileCommand: compileonly TestIntMax.test* bool compileonly = true >> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = >> true >> Warmup >> 5225 93 % 3 TestIntMax::test1 @ 5 (27 bytes) >> 5226 94 3 TestIntMax::test1 (27 bytes) >> 5226 95 % 4 TestIntMax::test1 @ 5 (27 bytes) >> 5238 96 4 TestIntMax::test1 (27 bytes) >> Run >> Time: 542056319 >> Warmup >> 6320 101 % 3 TestIntMax::test2 @ 5 (34 bytes) >> 6322 102 % 4 TestIntMax::test2 @ 5 (34 bytes) >> 6329 103 4 TestIntMax::test2 (34 bytes) >> Run >> Time: 166815209 >> >> That's a 4x regression on random input data! >> >> With: >> >> import java.util.Random; >> >> public class TestIntMax { >> private static Random RANDOM = new Random(); >> >> public static void main(String[] args) { >> int[] a = new int[64 * 1024]; >> for (int i = 0; i < a.length; i++) { >>... > >> I think we should be able to see the same issue here, actually. Yes. Here a >> quick benchmark below: > > I observe the same: > > > Warmup > 751 3 b TestIntMax::test1 (27 bytes) > Run > Time: 360 550 158 > Warmup > 1862 15 b TestIntMax::test2 (34 bytes) > Run > Time: 92 116 170 > > > But then with this: > > > diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad > index 8cc4a970bfd..9abda8f4178 100644 > --- a/src/hotspot/cpu/x86/x86_64.ad > +++ b/src/hotspot/cpu/x86/x86_64.ad > @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, > rFlagsReg cr) > %} > > > -instruct maxI_rReg(rRegI dst, rRegI src) > +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr) > %{ > match(Set dst (MaxI dst src)); > + effect(KILL cr); > > ins_cost(200); > - expand %{ > - rFlagsReg cr; > - compI_rReg(cr, dst, src); > - cmovI_reg_l(dst, src, cr); > + ins_encode %{ > + Label done; > + __ cmpl($src$$Register, $dst$$Register); > + __ jccb(Assembler::less, done); > + __ mov($dst$$Register, $src$$Register); > + __ bind(done); > %} > + ins_pipe(pipe_cmov_reg); > %} > > // > ============================================================================ > > > the performance gap narrows: > > > Warmup > 770 3 b TestIntMax::test1 (27 bytes) > Run > Time: 94 951 677 > Warmup > 1312 15 b TestIntMax::test2 (34 bytes) > Run > Time: 70 053 824 > > > (the number of test2 fluctuates quite a bit). Does it ever make sense to > implement `MaxI` with a conditional move then? @rwestrel @eme64 I think that the data distribution in the `TestIntMax` above matters (see my explanations in https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364), so I've enhanced the test to control data distribution in the int[] (see at the bottom). Here are the results I see on my AVX-512 machine: Probability: 50% Warmup 7834 92 % b 3 TestIntMax::test1 @ 5 (27 bytes) 7836 93 b 3 TestIntMax::test1 (27 bytes) 7838 94 % b 4 TestIntMax::test1 @ 5 (27 bytes) 7851 95 b 4 TestIntMax::test1 (27 bytes) Run Time: 699 923 014 Warmup 9272 96 % b 3 TestIntMax::test2 @ 5 (34 bytes) 9274 97 b 3 TestIntMax::test2 (34 bytes) 9275 98 % b 4 TestIntMax::test2 @ 5 (34 bytes) 9287 99 b 4 TestIntMax::test2 (34 bytes) Run Time: 699 815 792 Probability: 80% Warmup 7872 92 % b 3 TestIntMax::test1 @ 5 (27 bytes) 7874 93 b 3 TestIntMax::test1 (27 bytes) 7875 94 % b 4 TestIntMax::test1 @ 5 (27 bytes) 7889 95 b 4 TestIntMax::test1 (27 bytes) Run Time: 699 947 633 Warmup 9310 96 % b 3 TestIntMax::test2 @ 5 (34 bytes) 9311 97 b 3 TestIntMax::test2 (34 bytes) 9312 98 % b 4 TestIntMax::test2 @ 5 (34 bytes) 9325 99 b 4 TestIntMax::test2 (34 bytes) Run Time: 699 827 882 Probability: 100% Warmup 7884 92 % b 3 TestIntMax::test1 @ 5 (27 bytes) 7886 93 b 3 TestIntMax::test1 (27 bytes) 7888 94 % b 4 TestIntMax::test1 @ 5 (27 bytes) 7901 95 b 4 TestIntMax::test1 (27 bytes) Run Time: 699 931 243 Warmup 9322 96 % b 3 TestIntMax::test2 @ 5 (34 bytes) 9323 97 b 3 TestIntMax::test2 (34 bytes) 9324 98 % b 4 TestIntMax::test2 @ 5 (34 bytes) 9336 99 b 4 TestIntMax::test2 (34 bytes) Run Time: 1 077 937 282 import java.util.Random; import java.util.concurrent.ThreadLocalRandom; import java.text.DecimalFormat; import java.text.DecimalFormatSymbols; class TestIntMax { static final int RANGE = 16 * 1024; static final int ITER = 100_000; public static void main(String[] args) { final int probability = Integer.parseInt(args[0]); final DecimalFormatSymbols symbols = new DecimalFormatSymbols(); symbols.setGroupingSeparator(' '); final DecimalFormat format = new DecimalFormat("#,###", symbols); System.out.printf("Probability: %d%%%n", probability); int[] a = new int[64 * 1024]; init(a, probability); { System.out.println("Warmup"); for (int i = 0; i < 10_000; i++) { test1(a); } System.out.println("Run"); long t0 = System.nanoTime(); for (int i = 0; i < 10_000; i++) { test1(a); } long t1 = System.nanoTime(); System.out.println("Time: " + format.format(t1 - t0)); } { System.out.println("Warmup"); for (int i = 0; i < 10_000; i++) { test2(a); } System.out.println("Run"); long t0 = System.nanoTime(); for (int i = 0; i < 10_000; i++) { test2(a); } long t1 = System.nanoTime(); System.out.println("Time: " + format.format(t1 - t0)); } } public static int test1(int[] a) { int x = Integer.MIN_VALUE; for (int i = 0; i < a.length; i++) { x = Math.max(x, a[i]); } return x; } public static int test2(int[] a) { int x = Integer.MIN_VALUE; for (int i = 0; i < a.length; i++) { x = (x >= a[i]) ? x : a[i]; } return x; } public static void init(int[] ints, int probability) { int aboveCount, abovePercent; do { int max = ThreadLocalRandom.current().nextInt(10); ints[0] = max; aboveCount = 0; for (int i = 1; i < ints.length; i++) { int value; if (ThreadLocalRandom.current().nextInt(101) <= probability) { int increment = ThreadLocalRandom.current().nextInt(10); value = max + increment; aboveCount++; } else { // Decrement by at least 1 int decrement = ThreadLocalRandom.current().nextInt(10) + 1; value = max - decrement; } ints[i] = value; max = Math.max(max, value); } abovePercent = ((aboveCount + 1) * 100) / ints.length; } while (abovePercent != probability); } } Focusing my comment below on 100% which is where the differences appear: test2 (100%): ;; B12: # out( B21 B13 ) <- in( B11 B20 ) Freq: 1.6744e+09 0x00007f15bcada2e9: movl 0x14(%rsi, %rdx, 4), %r11d ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - TestIntMax::test2@14 (line 71) 0x00007f15bcada2ee: cmpl %r11d, %r10d 0x00007f15bcada2f1: jge 0x7f15bcada362 ;*istore_1 {reexecute=0 rethrow=0 return_oop=0} ; - TestIntMax::test2@25 (line 71) test1 (100%) ;; B10: # out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 strip mined) Freq: 1.6744e+09 0x00007f15bcad9a70: movl 0x4c(%rsi, %rdx, 4), %r11d 0x00007f15bcad9a75: movl %r11d, (%rsp) 0x00007f15bcad9a79: movl 0x48(%rsi, %rdx, 4), %r10d 0x00007f15bcad9a7e: movl %r10d, 4(%rsp) 0x00007f15bcad9a83: movl 0x10(%rsi, %rdx, 4), %r11d 0x00007f15bcad9a88: movl 0x14(%rsi, %rdx, 4), %r9d 0x00007f15bcad9a8d: movl 0x44(%rsi, %rdx, 4), %r10d 0x00007f15bcad9a92: movl %r10d, 8(%rsp) 0x00007f15bcad9a97: movl 0x18(%rsi, %rdx, 4), %r8d 0x00007f15bcad9a9c: cmpl %r11d, %eax 0x00007f15bcad9a9f: cmovll %r11d, %eax 0x00007f15bcad9aa3: cmpl %r9d, %eax 0x00007f15bcad9aa6: cmovll %r9d, %eax 0x00007f15bcad9aaa: movl 0x20(%rsi, %rdx, 4), %r10d 0x00007f15bcad9aaf: cmpl %r8d, %eax 0x00007f15bcad9ab2: cmovll %r8d, %eax 0x00007f15bcad9ab6: movl 0x24(%rsi, %rdx, 4), %r8d 0x00007f15bcad9abb: movl 0x28(%rsi, %rdx, 4), %r11d ; {no_reloc} 0x00007f15bcad9ac0: movl 0x2c(%rsi, %rdx, 4), %ecx 0x00007f15bcad9ac4: movl 0x30(%rsi, %rdx, 4), %r9d 0x00007f15bcad9ac9: movl 0x34(%rsi, %rdx, 4), %edi 0x00007f15bcad9acd: movl 0x38(%rsi, %rdx, 4), %ebx 0x00007f15bcad9ad1: movl 0x3c(%rsi, %rdx, 4), %ebp 0x00007f15bcad9ad5: movl 0x40(%rsi, %rdx, 4), %r13d 0x00007f15bcad9ada: movl 0x1c(%rsi, %rdx, 4), %r14d 0x00007f15bcad9adf: cmpl %r14d, %eax 0x00007f15bcad9ae2: cmovll %r14d, %eax 0x00007f15bcad9ae6: cmpl %r10d, %eax 0x00007f15bcad9ae9: cmovll %r10d, %eax 0x00007f15bcad9aed: cmpl %r8d, %eax 0x00007f15bcad9af0: cmovll %r8d, %eax 0x00007f15bcad9af4: cmpl %r11d, %eax 0x00007f15bcad9af7: cmovll %r11d, %eax 0x00007f15bcad9afb: cmpl %ecx, %eax 0x00007f15bcad9afd: cmovll %ecx, %eax 0x00007f15bcad9b00: cmpl %r9d, %eax 0x00007f15bcad9b03: cmovll %r9d, %eax 0x00007f15bcad9b07: cmpl %edi, %eax 0x00007f15bcad9b09: cmovll %edi, %eax 0x00007f15bcad9b0c: cmpl %ebx, %eax 0x00007f15bcad9b0e: cmovll %ebx, %eax 0x00007f15bcad9b11: cmpl %ebp, %eax 0x00007f15bcad9b13: cmovll %ebp, %eax 0x00007f15bcad9b16: cmpl %r13d, %eax 0x00007f15bcad9b19: cmovll %r13d, %eax 0x00007f15bcad9b1d: cmpl 8(%rsp), %eax 0x00007f15bcad9b21: movl 8(%rsp), %r11d 0x00007f15bcad9b26: cmovll %r11d, %eax 0x00007f15bcad9b2a: cmpl 4(%rsp), %eax 0x00007f15bcad9b2e: movl 4(%rsp), %r10d 0x00007f15bcad9b33: cmovll %r10d, %eax 0x00007f15bcad9b37: cmpl (%rsp), %eax 0x00007f15bcad9b3a: movl (%rsp), %r11d 0x00007f15bcad9b3e: cmovll %r11d, %eax ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ; - TestIntMax::test1@15 (line 61) ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663633050