Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Galder Zamarreño Mon, 17 Feb 2025 08:50:09 -0800

On Mon, 17 Feb 2025 15:02:32 GMT, Roland Westrelin <rol...@openjdk.org> wrote:


>> @rwestrel @galderz 
>> 
>>> It seems overall, we likely win more than we loose with this intrinsic, so 
>>> I would integrate this change as it is and file a bug to keep track of 
>>> remaining issues.
>> 
>> I'm a little scared to just accept the regressions, especially for this 
>> "most average looking case":
>> Imagine you have an array with random numbers. Or at least numbers in a 
>> random order. If we take the max, then we expect the first number to be max 
>> with probability 1, the second 1/2, the third 1/3, the i'th 1/i. So the 
>> average branch probability is `n / (sum_i 1/i)`. This goes closer and closer 
>> to zero, the larger the array. This means that the "average" case has an 
>> extreme probability. And so if we do not vectorize, then this gets us a 
>> regression with the current patch. And vectorization is a little fragile, it 
>> only takes very little for vectorization not to kick in.
>> 
>>> The Min/Max nodes are floating nodes. They can hoist out of loop and common 
>>> reliably in ways that are not guaranteed otherwise.
>> 
>> I suppose we could write an optimization that can hoist out loop independent 
>> if-diamonds out of a loop. If the condition and all phi inputs are loop 
>> invariant, you could just cut the diamond out of the loop, and paste it 
>> before the loop entry.
>> 
>>> Shouldn't int min/max be affected the same way?
>> 
>> I think we should be able to see the same issue here, actually. Yes. Here a 
>> quick benchmark below:
>> 
>> 
>> java -XX:CompileCommand=compileonly,TestIntMax::test* 
>> -XX:CompileCommand=printcompilation,TestIntMax::test* -XX:+TraceNewVectors 
>> TestIntMax.java
>> CompileCommand: compileonly TestIntMax.test* bool compileonly = true
>> CompileCommand: PrintCompilation TestIntMax.test* bool PrintCompilation = 
>> true
>> Warmup
>> 5225   93 %     3       TestIntMax::test1 @ 5 (27 bytes)
>> 5226   94       3       TestIntMax::test1 (27 bytes)
>> 5226   95 %     4       TestIntMax::test1 @ 5 (27 bytes)
>> 5238   96       4       TestIntMax::test1 (27 bytes)
>> Run
>> Time: 542056319
>> Warmup
>> 6320  101 %     3       TestIntMax::test2 @ 5 (34 bytes)
>> 6322  102 %     4       TestIntMax::test2 @ 5 (34 bytes)
>> 6329  103       4       TestIntMax::test2 (34 bytes)
>> Run
>> Time: 166815209
>> 
>> That's a 4x regression on random input data!
>> 
>> With:
>> 
>> import java.util.Random;
>> 
>> public class TestIntMax {
>>     private static Random RANDOM = new Random();
>> 
>>     public static void main(String[] args) {
>>         int[] a = new int[64 * 1024];
>>         for (int i = 0; i < a.length; i++) {
>>...
>
>> I think we should be able to see the same issue here, actually. Yes. Here a 
>> quick benchmark below:
> 
> I observe the same:
> 
> 
> Warmup
> 751    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 360 550 158
> Warmup
> 1862   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 92 116 170
> 
> 
> But then with this:
> 
> 
> diff --git a/src/hotspot/cpu/x86/x86_64.ad b/src/hotspot/cpu/x86/x86_64.ad
> index 8cc4a970bfd..9abda8f4178 100644
> --- a/src/hotspot/cpu/x86/x86_64.ad
> +++ b/src/hotspot/cpu/x86/x86_64.ad
> @@ -12037,16 +12037,20 @@ instruct cmovI_reg_l(rRegI dst, rRegI src, 
> rFlagsReg cr)
>  %}
>  
>  
> -instruct maxI_rReg(rRegI dst, rRegI src)
> +instruct maxI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
>  %{
>    match(Set dst (MaxI dst src));
> +  effect(KILL cr);
>  
>    ins_cost(200);
> -  expand %{
> -    rFlagsReg cr;
> -    compI_rReg(cr, dst, src);
> -    cmovI_reg_l(dst, src, cr);
> +  ins_encode %{
> +    Label done;
> +    __ cmpl($src$$Register, $dst$$Register);
> +    __ jccb(Assembler::less, done);
> +    __ mov($dst$$Register, $src$$Register);
> +    __ bind(done);
>    %}
> +  ins_pipe(pipe_cmov_reg);
>  %}
>  
>  // 
> ============================================================================
> 
> 
> the performance gap narrows:
> 
> 
> Warmup
> 770    3    b        TestIntMax::test1 (27 bytes)
> Run
> Time: 94 951 677
> Warmup
> 1312   15    b        TestIntMax::test2 (34 bytes)
> Run
> Time: 70 053 824
> 
> 
> (the number of test2 fluctuates quite a bit). Does it ever make sense to 
> implement `MaxI` with a conditional move then?

@rwestrel @eme64 I think that the data distribution in the `TestIntMax` above 
matters (see my explanations in 
https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364), so I've 
enhanced the test to control data distribution in the int[] (see at the bottom).

Here are the results I see on my AVX-512 machine:


Probability: 50%
Warmup
7834   92 %  b  3       TestIntMax::test1 @ 5 (27 bytes)
7836   93    b  3       TestIntMax::test1 (27 bytes)
7838   94 %  b  4       TestIntMax::test1 @ 5 (27 bytes)
7851   95    b  4       TestIntMax::test1 (27 bytes)
Run
Time: 699 923 014
Warmup
9272   96 %  b  3       TestIntMax::test2 @ 5 (34 bytes)
9274   97    b  3       TestIntMax::test2 (34 bytes)
9275   98 %  b  4       TestIntMax::test2 @ 5 (34 bytes)
9287   99    b  4       TestIntMax::test2 (34 bytes)
Run
Time: 699 815 792

Probability: 80%
Warmup
7872   92 %  b  3       TestIntMax::test1 @ 5 (27 bytes)
7874   93    b  3       TestIntMax::test1 (27 bytes)
7875   94 %  b  4       TestIntMax::test1 @ 5 (27 bytes)
7889   95    b  4       TestIntMax::test1 (27 bytes)
Run
Time: 699 947 633
Warmup
9310   96 %  b  3       TestIntMax::test2 @ 5 (34 bytes)
9311   97    b  3       TestIntMax::test2 (34 bytes)
9312   98 %  b  4       TestIntMax::test2 @ 5 (34 bytes)
9325   99    b  4       TestIntMax::test2 (34 bytes)
Run
Time: 699 827 882

Probability: 100%
Warmup
7884   92 %  b  3       TestIntMax::test1 @ 5 (27 bytes)
7886   93    b  3       TestIntMax::test1 (27 bytes)
7888   94 %  b  4       TestIntMax::test1 @ 5 (27 bytes)
7901   95    b  4       TestIntMax::test1 (27 bytes)
Run
Time: 699 931 243
Warmup
9322   96 %  b  3       TestIntMax::test2 @ 5 (34 bytes)
9323   97    b  3       TestIntMax::test2 (34 bytes)
9324   98 %  b  4       TestIntMax::test2 @ 5 (34 bytes)
9336   99    b  4       TestIntMax::test2 (34 bytes)
Run
Time: 1 077 937 282



import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
import java.text.DecimalFormat;
import java.text.DecimalFormatSymbols;

class TestIntMax
{
    static final int RANGE = 16 * 1024;
    static final int ITER = 100_000;

    public static void main(String[] args)
    {
        final int probability = Integer.parseInt(args[0]);

        final DecimalFormatSymbols symbols = new DecimalFormatSymbols();
        symbols.setGroupingSeparator(' ');
        final DecimalFormat format = new DecimalFormat("#,###", symbols);

        System.out.printf("Probability: %d%%%n", probability);
        int[] a = new int[64 * 1024];
        init(a, probability);

        {
            System.out.println("Warmup");
            for (int i = 0; i < 10_000; i++)
            {
                test1(a);
            }
            System.out.println("Run");
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++)
            {
                test1(a);
            }
            long t1 = System.nanoTime();
            System.out.println("Time: " + format.format(t1 - t0));
        }

        {
            System.out.println("Warmup");
            for (int i = 0; i < 10_000; i++)
            {
                test2(a);
            }
            System.out.println("Run");
            long t0 = System.nanoTime();
            for (int i = 0; i < 10_000; i++)
            {
                test2(a);
            }
            long t1 = System.nanoTime();
            System.out.println("Time: " + format.format(t1 - t0));
        }
    }

    public static int test1(int[] a)
    {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++)
        {
            x = Math.max(x, a[i]);
        }
        return x;
    }

    public static int test2(int[] a)
    {
        int x = Integer.MIN_VALUE;
        for (int i = 0; i < a.length; i++)
        {
            x = (x >= a[i]) ? x : a[i];
        }
        return x;
    }

    public static void init(int[] ints, int probability)
    {
        int aboveCount, abovePercent;

        do
        {
            int max = ThreadLocalRandom.current().nextInt(10);
            ints[0] = max;

            aboveCount = 0;
            for (int i = 1; i < ints.length; i++)
            {
                int value;
                if (ThreadLocalRandom.current().nextInt(101) <= probability)
                {
                    int increment = ThreadLocalRandom.current().nextInt(10);
                    value = max + increment;
                    aboveCount++;
                }
                else
                {
                    // Decrement by at least 1
                    int decrement = ThreadLocalRandom.current().nextInt(10) + 1;
                    value = max - decrement;
                }
                ints[i] = value;
                max = Math.max(max, value);
            }

            abovePercent = ((aboveCount + 1) * 100) / ints.length;
        } while (abovePercent != probability);
    }
}


Focusing my comment below on 100% which is where the differences appear:

test2 (100%):

 ;; B12: #      out( B21 B13 ) <- in( B11 B20 )  Freq: 1.6744e+09
  0x00007f15bcada2e9:   movl            0x14(%rsi, %rdx, 4), %r11d
                                                            ;*iaload 
{reexecute=0 rethrow=0 return_oop=0}
                                                            ; - 
TestIntMax::test2@14 (line 71)
  0x00007f15bcada2ee:   cmpl            %r11d, %r10d
  0x00007f15bcada2f1:   jge             0x7f15bcada362      ;*istore_1 
{reexecute=0 rethrow=0 return_oop=0}
                                                            ; - 
TestIntMax::test2@25 (line 71)


test1 (100%)

 ;; B10: #      out( B10 B11 ) <- in( B9 B10 ) Loop( B10-B10 inner main of N64 
strip mined) Freq: 1.6744e+09
  0x00007f15bcad9a70:   movl            0x4c(%rsi, %rdx, 4), %r11d
  0x00007f15bcad9a75:   movl            %r11d, (%rsp)
  0x00007f15bcad9a79:   movl            0x48(%rsi, %rdx, 4), %r10d
  0x00007f15bcad9a7e:   movl            %r10d, 4(%rsp)
  0x00007f15bcad9a83:   movl            0x10(%rsi, %rdx, 4), %r11d
  0x00007f15bcad9a88:   movl            0x14(%rsi, %rdx, 4), %r9d
  0x00007f15bcad9a8d:   movl            0x44(%rsi, %rdx, 4), %r10d
  0x00007f15bcad9a92:   movl            %r10d, 8(%rsp)
  0x00007f15bcad9a97:   movl            0x18(%rsi, %rdx, 4), %r8d
  0x00007f15bcad9a9c:   cmpl            %r11d, %eax
  0x00007f15bcad9a9f:   cmovll          %r11d, %eax
  0x00007f15bcad9aa3:   cmpl            %r9d, %eax
  0x00007f15bcad9aa6:   cmovll          %r9d, %eax
  0x00007f15bcad9aaa:   movl            0x20(%rsi, %rdx, 4), %r10d
  0x00007f15bcad9aaf:   cmpl            %r8d, %eax
  0x00007f15bcad9ab2:   cmovll          %r8d, %eax
  0x00007f15bcad9ab6:   movl            0x24(%rsi, %rdx, 4), %r8d
  0x00007f15bcad9abb:   movl            0x28(%rsi, %rdx, 4), %r11d
                                                            ;   {no_reloc}
  0x00007f15bcad9ac0:   movl            0x2c(%rsi, %rdx, 4), %ecx
  0x00007f15bcad9ac4:   movl            0x30(%rsi, %rdx, 4), %r9d
  0x00007f15bcad9ac9:   movl            0x34(%rsi, %rdx, 4), %edi
  0x00007f15bcad9acd:   movl            0x38(%rsi, %rdx, 4), %ebx
  0x00007f15bcad9ad1:   movl            0x3c(%rsi, %rdx, 4), %ebp
  0x00007f15bcad9ad5:   movl            0x40(%rsi, %rdx, 4), %r13d
  0x00007f15bcad9ada:   movl            0x1c(%rsi, %rdx, 4), %r14d
  0x00007f15bcad9adf:   cmpl            %r14d, %eax
  0x00007f15bcad9ae2:   cmovll          %r14d, %eax
  0x00007f15bcad9ae6:   cmpl            %r10d, %eax
  0x00007f15bcad9ae9:   cmovll          %r10d, %eax
  0x00007f15bcad9aed:   cmpl            %r8d, %eax
  0x00007f15bcad9af0:   cmovll          %r8d, %eax
  0x00007f15bcad9af4:   cmpl            %r11d, %eax
  0x00007f15bcad9af7:   cmovll          %r11d, %eax
  0x00007f15bcad9afb:   cmpl            %ecx, %eax
  0x00007f15bcad9afd:   cmovll          %ecx, %eax
  0x00007f15bcad9b00:   cmpl            %r9d, %eax
  0x00007f15bcad9b03:   cmovll          %r9d, %eax
  0x00007f15bcad9b07:   cmpl            %edi, %eax
  0x00007f15bcad9b09:   cmovll          %edi, %eax
  0x00007f15bcad9b0c:   cmpl            %ebx, %eax
  0x00007f15bcad9b0e:   cmovll          %ebx, %eax
  0x00007f15bcad9b11:   cmpl            %ebp, %eax
  0x00007f15bcad9b13:   cmovll          %ebp, %eax
  0x00007f15bcad9b16:   cmpl            %r13d, %eax
  0x00007f15bcad9b19:   cmovll          %r13d, %eax
  0x00007f15bcad9b1d:   cmpl            8(%rsp), %eax
  0x00007f15bcad9b21:   movl            8(%rsp), %r11d
  0x00007f15bcad9b26:   cmovll          %r11d, %eax
  0x00007f15bcad9b2a:   cmpl            4(%rsp), %eax
  0x00007f15bcad9b2e:   movl            4(%rsp), %r10d
  0x00007f15bcad9b33:   cmovll          %r10d, %eax
  0x00007f15bcad9b37:   cmpl            (%rsp), %eax
  0x00007f15bcad9b3a:   movl            (%rsp), %r11d
  0x00007f15bcad9b3e:   cmovll          %r11d, %eax         ;*invokestatic max 
{reexecute=0 rethrow=0 return_oop=0}
                                                            ; - 
TestIntMax::test1@15 (line 61)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2663633050

Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Reply via email to