On Wed, 9 Oct 2024 09:59:11 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> This patch optimizes LongVector multiplication by inferring VPMULUDQ >> instruction for following IR pallets. >> >> >> MulL ( And SRC1, 0xFFFFFFFF) ( And SRC2, 0xFFFFFFFF) >> MulL (URShift SRC1 , 32) (URShift SRC2, 32) >> MulL (URShift SRC1 , 32) ( And SRC2, 0xFFFFFFFF) >> MulL ( And SRC1, 0xFFFFFFFF) (URShift SRC2 , 32) >> >> >> >> A 64x64 bit multiplication produces 128 bit result, and can be performed >> by individually multiplying upper and lower double word of multiplier with >> multiplicand and assembling the partial products to compute full width >> result. Targets supporting vector quadword multiplication have separate >> instructions to compute upper and lower quadwords for 128 bit result. >> Therefore existing VectorAPI multiplication operator expects shape >> conformance between source and result vectors. >> >> If upper 32 bits of quadword multiplier and multiplicand is always set to >> zero then result of multiplication is only dependent on the partial product >> of their lower double words and can be performed using unsigned 32 bit >> multiplication instruction with quadword saturation. Patch matches this >> pattern in a target dependent manner without introducing new IR node. >> >> VPMULUDQ instruction performs unsigned multiplication between even numbered >> doubleword lanes of two long vectors and produces 64 bit result. It has >> much lower latency compared to full 64 bit multiplication instruction >> "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword >> multiplication, thus we can save redundant partial product for zeroed out >> upper 32 bits. This results into throughput improvements on both P and E >> core Xeons. >> >> Please find below the performance of [XXH3 hashing benchmark >> ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included >> with the patch:- >> >> >> Sierra Forest :- >> ============ >> Baseline:- >> Benchmark (SIZE) Mode Cnt Score >> Error Units >> VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 806.228 >> ops/ms >> VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 403.044 >> ops/ms >> VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 200.641 >> ops/ms >> VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 100.664 >> ops/ms >> >> With Optimization:- >> Benchmark (SIZE) Mode Cnt Score >> Error Units >> VectorXXH3HashingBenchmark.hashingKernel ... > > Jatin Bhateja has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains two commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137 > - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction > Hi @jaskarth , Bigger pattern matching is sensitive to [IR level node > sharing](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/matcher.cpp#L1724), > thus it may not full proof for above 4 patterns. Current patch takes care of > this limitation. I think this is a good point. I've taken a look at the patch and added some comments below. src/hotspot/cpu/x86/matcher_x86.hpp line 184: > 182: // Does the CPU supports doubleword multiplication with quadword > saturation. > 183: static constexpr bool > supports_double_word_mult_with_quadword_staturation(void) { > 184: return true; Should this be `UseAVX > 0`? I'm wondering since we have a `MulVL` rule that applies when `UseAVX == 0`. src/hotspot/share/opto/vectornode.cpp line 2089: > 2087: if (Matcher::supports_double_word_mult_with_quadword_staturation() && > 2088: !is_mult_lower_double_word()) { > 2089: auto is_clear_upper_double_word_uright_shift_op = [](const Node *n) > { Suggestion: auto is_clear_upper_double_word_uright_shift_op = [](const Node* n) { src/hotspot/share/opto/vectornode.cpp line 2093: > 2091: n->in(2)->Opcode() == Op_RShiftCntV && > n->in(2)->in(1)->is_Con() && > 2092: n->in(2)->in(1)->bottom_type()->isa_int() && > 2093: n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32L; Suggestion: n->in(2)->in(1)->bottom_type()->is_int()->get_con() == 32; Since you are comparing with a `TypeInt` I think this shouldn't be `32L`. src/hotspot/share/opto/vectornode.cpp line 2098: > 2096: auto is_lower_double_word_and_mask_op = [](const Node *n) { > 2097: if (n->Opcode() == Op_AndV) { > 2098: Node *replicate_operand = n->in(1)->Opcode() == Op_Replicate ? > n->in(1) Suggestion: Node* replicate_operand = n->in(1)->Opcode() == Op_Replicate ? n->in(1) src/hotspot/share/opto/vectornode.cpp line 2124: > 2122: // MulL ( And SRC1, 0xFFFFFFFF) (URShift SRC2 , 32) > 2123: if ((is_lower_double_word_and_mask_op(in(1)) || > 2124: is_lower_double_word_and_mask_op(in(1)) || `is_lower_double_word_and_mask_op(in(1)) || is_lower_double_word_and_mask_op(in(1))` is redundant, right? Shouldn't you only need it once? Same for the other 3 calls, which are similarly repeated. test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 41: > 39: */ > 40: > 41: public class VectorMultiplyOpt { Could it be possible to also do IR verification in this test? It would be good to check that we don't generate `AndVL` or `URShiftVL` with this transform. test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 43: > 41: public class VectorMultiplyOpt { > 42: > 43: public static long [] src1; Suggestion: public static long[] src1; And for the rest of the `long []` in this file too. test/micro/org/openjdk/bench/jdk/incubator/vector/VectorXXH3HashingBenchmark.java line 39: > 37: @Param({"1024", "2048", "4096", "8192"}) > 38: private int SIZE; > 39: private long [] accumulators; Suggestion: private long[] accumulators; ------------- PR Review: https://git.openjdk.org/jdk/pull/21244#pullrequestreview-2367683334 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800159123 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153755 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153568 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800153842 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800151177 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800167403 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800165261 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1800169840