Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction [v2]

Jasmine Karthikeyan Mon, 14 Oct 2024 14:53:49 -0700

On Wed, 9 Oct 2024 09:59:11 GMT, Jatin Bhateja <[email protected]> wrote:


>> This patch optimizes LongVector multiplication by inferring VPMULUDQ 
>> instruction for following IR pallets.
>>   
>> 
>>        MulL   ( And  SRC1,  0xFFFFFFFF)   ( And  SRC2,  0xFFFFFFFF) 
>>        MulL   (URShift SRC1 , 32) (URShift SRC2, 32)
>>        MulL   (URShift SRC1 , 32)  ( And  SRC2,  0xFFFFFFFF)
>>        MulL   ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)
>> 
>> 
>> 
>>  A  64x64 bit multiplication produces 128 bit result, and can be performed 
>> by individually multiplying upper and lower double word of multiplier with 
>> multiplicand and assembling the partial products to compute full width 
>> result. Targets supporting vector quadword multiplication have separate 
>> instructions to compute upper and lower quadwords for 128 bit result. 
>> Therefore existing VectorAPI multiplication operator expects shape 
>> conformance between source and result vectors.
>> 
>> If upper 32 bits of quadword multiplier and multiplicand is always set to 
>> zero then result of multiplication is only dependent on the partial product 
>> of their lower double words and can be performed using unsigned 32 bit 
>> multiplication instruction with quadword saturation. Patch matches this 
>> pattern in a target dependent manner without introducing new IR node.
>>  
>> VPMULUDQ instruction performs unsigned multiplication between even numbered 
>> doubleword lanes of two long vectors and produces 64 bit result.  It has 
>> much lower latency compared to full 64 bit multiplication instruction 
>> "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword 
>> multiplication, thus we can save redundant partial product for zeroed out 
>> upper 32 bits. This results into throughput improvements on both P and E 
>> core Xeons.
>> 
>> Please find below the performance of [XXH3 hashing benchmark 
>> ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included
>>  with the patch:-
>>  
>> 
>> Sierra Forest :-
>> ============
>> Baseline:-
>> Benchmark                                 (SIZE)   Mode  Cnt    Score   
>> Error   Units
>> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228        
>>   ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044        
>>   ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641        
>>   ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664        
>>   ops/ms
>> 
>> With Optimization:-
>> Benchmark                                 (SIZE)   Mode  Cnt     Score   
>> Error   Units
>> VectorXXH3HashingBenchmark.hashingKernel   ...
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains two commits:
> 
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137
>  - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction

> Hi @jaskarth , Bigger pattern matching is sensitive to [IR level node 
> sharing](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L1724),
>  thus it may not full proof for above 4 patterns. Current patch takes care of 
> this limitation.

I think this is a good point. I've taken a look at the patch and added some 
comments below.

src/hotspot/cpu/x86/matcher_x86.hpp line 184:

> 182:   // Does the CPU supports doubleword multiplication with quadword 
> saturation.
> 183:   static constexpr bool 
> supports_double_word_mult_with_quadword_staturation(void) {
> 184:     return true;

Should this be `UseAVX > 0`? I'm wondering since we have a `MulVL` rule that 
applies when `UseAVX == 0`.

src/hotspot/share/opto/vectornode.cpp line 2089:

> 2087:   if (Matcher::supports_double_word_mult_with_quadword_staturation() &&
> 2088:       !is_mult_lower_double_word()) {
> 2089:     auto is_clear_upper_double_word_uright_shift_op = [](const Node *n) 
> {

Suggestion:

    auto is_clear_upper_double_word_uright_shift_op = [](const Node* n) {

src/hotspot/share/opto/vectornode.cpp line 2093:

> 2091:              n->in(2)->Opcode() == Op_RShiftCntV && 
> n->in(2)->in(1)->is_Con() &&
> 2092:              n->in(2)->in(1)->bottom_type()->isa_int() &&
> 2093:              n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32L;

Suggestion:

             n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32;

Since you are comparing with a `TypeInt` I think this shouldn't be `32L`.

src/hotspot/share/opto/vectornode.cpp line 2098:

> 2096:     auto is_lower_double_word_and_mask_op = [](const Node *n) {
> 2097:       if (n->Opcode() == Op_AndV) {
> 2098:         Node *replicate_operand = n->in(1)->Opcode() == Op_Replicate ? 
> n->in(1)

Suggestion:

        Node* replicate_operand = n->in(1)->Opcode() == Op_Replicate ? n->in(1)

src/hotspot/share/opto/vectornode.cpp line 2124:

> 2122:     // MulL ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)
> 2123:     if ((is_lower_double_word_and_mask_op(in(1)) ||
> 2124:          is_lower_double_word_and_mask_op(in(1)) ||

`is_lower_double_word_and_mask_op(in(1)) || 
is_lower_double_word_and_mask_op(in(1))` is redundant, right? Shouldn't you 
only need it once? Same for the other 3 calls, which are similarly repeated.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 41:

> 39:  */
> 40: 
> 41: public class VectorMultiplyOpt {

Could it be possible to also do IR verification in this test? It would be good 
to check that we don't generate `AndVL` or `URShiftVL` with this transform.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 43:

> 41: public class VectorMultiplyOpt {
> 42: 
> 43:     public static long [] src1;

Suggestion:

    public static long[] src1;

And for the rest of the `long []` in this file too.

test/micro/org/openjdk/bench/jdk/incubator/vector/VectorXXH3HashingBenchmark.java
 line 39:

> 37:     @Param({"1024", "2048", "4096", "8192"})
> 38:     private int  SIZE;
> 39:     private long [] accumulators;

Suggestion:

    private long[] accumulators;

-------------

PR Review: https://git.openjdk.org/jdk/pull/21244#pullrequestreview-2367683334
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800159123
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153755
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153568
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153842
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800151177
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800167403
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800165261
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800169840

Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction [v2]

Reply via email to