Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Jasmine Karthikeyan Wed, 06 Nov 2024 09:40:06 -0800

On Sun, 29 Sep 2024 04:21:19 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ 
> instruction for following IR pallets.
>   
> 
>        MulVL   ( AndV  SRC1,  0xFFFFFFFF)   ( AndV  SRC2,  0xFFFFFFFF) 
>        MulVL   (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32)
>        MulVL   (URShiftVL SRC1 , 32)  ( AndV  SRC2,  0xFFFFFFFF)
>        MulVL   ( AndV  SRC1,  0xFFFFFFFF) (URShiftVL SRC2 , 32)
>        MulVL   (VectorCastI2X SRC1) (VectorCastI2X SRC2)
>        MulVL   (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32)
> 
> 
> 
>  A  64x64 bit multiplication produces 128 bit result, and can be performed by 
> individually multiplying upper and lower double word of multiplier with 
> multiplicand and assembling the partial products to compute full width 
> result. Targets supporting vector quadword multiplication have separate 
> instructions to compute upper and lower quadwords for 128 bit result. 
> Therefore existing VectorAPI multiplication operator expects shape 
> conformance between source and result vectors.
> 
> If upper 32 bits of quadword multiplier and multiplicand is always set to 
> zero then result of multiplication is only dependent on the partial product 
> of their lower double words and can be performed using unsigned 32 bit 
> multiplication instruction with quadword saturation. Patch matches this 
> pattern in a target dependent manner without introducing new IR node.
>  
> VPMUL[U]DQ instruction performs [unsigned] multiplication between even 
> numbered doubleword lanes of two long vectors and produces 64 bit result.  It 
> has much lower latency compared to full 64 bit multiplication instruction 
> "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword 
> multiplication, thus we can save redundant partial product for zeroed out 
> upper 32 bits. This results into throughput improvements on both P and E core 
> Xeons.
> 
> Please find below the performance of [XXH3 hashing benchmark 
> ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included
>  with the patch:-
>  
> 
> Sierra Forest :-
> ============
> Baseline:-
> Benchmark                                 (SIZE)   Mode  Cnt    Score   Error 
>   Units
> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664         
>  ops/ms
> 
> With Optimization:-
> Benchmark                                 (SIZE)   Mode  ...

> Hi @jaskarth , Bigger pattern matching is sensitive to [IR level node 
> sharing](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L1724),
>  thus it may not full proof for above 4 patterns. Current patch takes care of 
> this limitation.

I think this is a good point. I've taken a look at the patch and added some 
comments below.

Hmm, do you think this pattern could be matched in the ad-files instead of the 
middle end? I think that might be a lot cleaner since the backend already has 
systems for matching node trees, which could avoid a lot of the complexity 
here. I think it could make the patch a lot smaller and simpler.

For the record I think in this PR we could simply match the IR patterns in the 
ad file, since (from my understanding) the patterns we are matching could be 
supported there. We should do platform-specific lowering in a separate patch 
because it is pretty nuanced, and we could potentially move it to the new 
system afterwards.

I'm pretty ambivalent, I think implementing it either way would be alright. 
Especially with unit tests, I think the lowering implementation wouldn't be 
that difficult. Maybe another reviewer has an opinion?

About PhaseLowering though, I've found some more interesting things we could do 
with it, especially with improving vectorization support in the backend. 
@merykitty have you already started to work on it? I was thinking about 
prototyping it soon. Just wanted to make sure we're not doing the same work 
twice :)

BTW, from the last conversation I had started working on PhaseLowering myself, 
you can see my work so far on my branch: 
https://github.com/jaskarth/jdk/tree/phase-lowering. I think I can publish an 
RFE in the coming two or three days (there were some optimizations and cleanup 
I was prototyping, I will remove them before sending a PR.) Do you think we 
should continue with my branch or do you want to approach the problem from a 
different way? Just want to check again to make sure we don't end up re-doing 
the same work :)

src/hotspot/cpu/x86/matcher_x86.hpp line 184:

> 182:   // Does the CPU supports doubleword multiplication with quadword 
> saturation.
> 183:   static constexpr bool 
> supports_double_word_mult_with_quadword_staturation(void) {
> 184:     return true;

Should this be `UseAVX > 0`? I'm wondering since we have a `MulVL` rule that 
applies when `UseAVX == 0`.

src/hotspot/share/opto/vectornode.cpp line 2089:

> 2087:   if (Matcher::supports_double_word_mult_with_quadword_staturation() &&
> 2088:       !is_mult_lower_double_word()) {
> 2089:     auto is_clear_upper_double_word_uright_shift_op = [](const Node *n) 
> {

Suggestion:

    auto is_clear_upper_double_word_uright_shift_op = [](const Node* n) {

src/hotspot/share/opto/vectornode.cpp line 2093:

> 2091:              n->in(2)->Opcode() == Op_RShiftCntV && 
> n->in(2)->in(1)->is_Con() &&
> 2092:              n->in(2)->in(1)->bottom_type()->isa_int() &&
> 2093:              n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32L;

Suggestion:

             n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32;

Since you are comparing with a `TypeInt` I think this shouldn't be `32L`.

src/hotspot/share/opto/vectornode.cpp line 2098:

> 2096:     auto is_lower_double_word_and_mask_op = [](const Node *n) {
> 2097:       if (n->Opcode() == Op_AndV) {
> 2098:         Node *replicate_operand = n->in(1)->Opcode() == Op_Replicate ? 
> n->in(1)

Suggestion:

        Node* replicate_operand = n->in(1)->Opcode() == Op_Replicate ? n->in(1)

src/hotspot/share/opto/vectornode.cpp line 2124:

> 2122:     // MulL ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)
> 2123:     if ((is_lower_double_word_and_mask_op(in(1)) ||
> 2124:          is_lower_double_word_and_mask_op(in(1)) ||

`is_lower_double_word_and_mask_op(in(1)) || 
is_lower_double_word_and_mask_op(in(1))` is redundant, right? Shouldn't you 
only need it once? Same for the other 3 calls, which are similarly repeated.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 41:

> 39:  */
> 40: 
> 41: public class VectorMultiplyOpt {

Could it be possible to also do IR verification in this test? It would be good 
to check that we don't generate `AndVL` or `URShiftVL` with this transform.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 43:

> 41: public class VectorMultiplyOpt {
> 42: 
> 43:     public static long [] src1;

Suggestion:

    public static long[] src1;

And for the rest of the `long []` in this file too.

test/micro/org/openjdk/bench/jdk/incubator/vector/VectorXXH3HashingBenchmark.java
 line 39:

> 37:     @Param({"1024", "2048", "4096", "8192"})
> 38:     private int  SIZE;
> 39:     private long [] accumulators;

Suggestion:

    private long[] accumulators;

-------------

PR Review: https://git.openjdk.org/jdk/pull/21244#pullrequestreview-2367683334
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2407658405
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2411538179
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2414553899
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2422700344
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800159123
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153755
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153568
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153842
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800151177
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800167403
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800165261
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800169840

Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Reply via email to