Re: RFR: 8318650: Optimized subword gather for x86 targets. [v5]

Jatin Bhateja Thu, 09 Nov 2023 20:58:36 -0800

On Fri, 10 Nov 2023 03:33:51 GMT, Sandhya Viswanathan 
<sviswanat...@openjdk.org> wrote:


>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1648:
>> 
>>> 1646:     vpermd(xtmp3, xtmp1, xtmp3, vlen_enc == Assembler::AVX_512bit ? 
>>> vlen_enc : Assembler::AVX_256bit);
>>> 1647:     vpsubd(xtmp1, xtmp1, xtmp2, vlen_enc);
>>> 1648:     vpor(dst, dst, xtmp3, vlen_enc);
>> 
>> xtmp1 starts out as 0, 1,...
>> so vpermd will place the lower 64 bit from xtmp3 to lower 64 bit of dst
>> why vpsubd and not vpaddd? It looks to me that vpaddd is more intutive to 
>> understand.
>> if vpadd, xtmp1 will become 2,3 in next iteration 
>> so vpermd will place the lower 64 bit from xtmp3 to 127:64 of dst 
>> and so on so forth
>> 
>> Another point, for avx512 it looks to me that vpermd and vpor could be 
>> merged into one single instruction vpermd having dst as destination and 
>> merge bit set to true.
>
> Please ignore the last bit about avx512 vpermd merge as we are not using mask 
> registers here.

> xtmp1 starts out as 0, 1,... so vpermd will place the lower 64 bit from xtmp3 
> to lower 64 bit of dst why vpsubd and not vpaddd? It looks to me that vpaddd 
> is more intutive to understand. if vpadd, xtmp1 will become 2,3 in next 
> iteration so vpermd will place the lower 64 bit from xtmp3 to 127:64 of dst 
> and so on so forth
> 
I have taken a different approach here based on progressive subtraction to get 
permute indices for each iteration.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1388901259

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v5]

Reply via email to