Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction [v2]

Jasmine Karthikeyan Mon, 14 Oct 2024 08:07:38 -0700

On Wed, 9 Oct 2024 09:59:11 GMT, Jatin Bhateja <[email protected]> wrote:


>> This patch optimizes LongVector multiplication by inferring VPMULUDQ 
>> instruction for following IR pallets.
>>   
>> 
>>        MulL   ( And  SRC1,  0xFFFFFFFF)   ( And  SRC2,  0xFFFFFFFF) 
>>        MulL   (URShift SRC1 , 32) (URShift SRC2, 32)
>>        MulL   (URShift SRC1 , 32)  ( And  SRC2,  0xFFFFFFFF)
>>        MulL   ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)
>> 
>> 
>> 
>>  A  64x64 bit multiplication produces 128 bit result, and can be performed 
>> by individually multiplying upper and lower double word of multiplier with 
>> multiplicand and assembling the partial products to compute full width 
>> result. Targets supporting vector quadword multiplication have separate 
>> instructions to compute upper and lower quadwords for 128 bit result. 
>> Therefore existing VectorAPI multiplication operator expects shape 
>> conformance between source and result vectors.
>> 
>> If upper 32 bits of quadword multiplier and multiplicand is always set to 
>> zero then result of multiplication is only dependent on the partial product 
>> of their lower double words and can be performed using unsigned 32 bit 
>> multiplication instruction with quadword saturation. Patch matches this 
>> pattern in a target dependent manner without introducing new IR node.
>>  
>> VPMULUDQ instruction performs unsigned multiplication between even numbered 
>> doubleword lanes of two long vectors and produces 64 bit result.  It has 
>> much lower latency compared to full 64 bit multiplication instruction 
>> "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword 
>> multiplication, thus we can save redundant partial product for zeroed out 
>> upper 32 bits. This results into throughput improvements on both P and E 
>> core Xeons.
>> 
>> Please find below the performance of [XXH3 hashing benchmark 
>> ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included
>>  with the patch:-
>>  
>> 
>> Sierra Forest :-
>> ============
>> Baseline:-
>> Benchmark                                 (SIZE)   Mode  Cnt    Score   
>> Error   Units
>> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228        
>>   ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044        
>>   ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641        
>>   ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664        
>>   ops/ms
>> 
>> With Optimization:-
>> Benchmark                                 (SIZE)   Mode  Cnt     Score   
>> Error   Units
>> VectorXXH3HashingBenchmark.hashingKernel   ...
>
> Jatin Bhateja has updated the pull request with a new target base due to a 
> merge or a rebase. The pull request now contains two commits:
> 
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137
>  - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction

For the record I think in this PR we could simply match the IR patterns in the 
ad file, since (from my understanding) the patterns we are matching could be 
supported there. We should do platform-specific lowering in a separate patch 
because it is pretty nuanced, and we could potentially move it to the new 
system afterwards.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2411538179

Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction [v2]

Reply via email to