On Fri, 10 Nov 2023 03:33:51 GMT, Sandhya Viswanathan <sviswanat...@openjdk.org> wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1648: >> >>> 1646: vpermd(xtmp3, xtmp1, xtmp3, vlen_enc == Assembler::AVX_512bit ? >>> vlen_enc : Assembler::AVX_256bit); >>> 1647: vpsubd(xtmp1, xtmp1, xtmp2, vlen_enc); >>> 1648: vpor(dst, dst, xtmp3, vlen_enc); >> >> xtmp1 starts out as 0, 1,... >> so vpermd will place the lower 64 bit from xtmp3 to lower 64 bit of dst >> why vpsubd and not vpaddd? It looks to me that vpaddd is more intutive to >> understand. >> if vpadd, xtmp1 will become 2,3 in next iteration >> so vpermd will place the lower 64 bit from xtmp3 to 127:64 of dst >> and so on so forth >> >> Another point, for avx512 it looks to me that vpermd and vpor could be >> merged into one single instruction vpermd having dst as destination and >> merge bit set to true. > > Please ignore the last bit about avx512 vpermd merge as we are not using mask > registers here. > xtmp1 starts out as 0, 1,... so vpermd will place the lower 64 bit from xtmp3 > to lower 64 bit of dst why vpsubd and not vpaddd? It looks to me that vpaddd > is more intutive to understand. if vpadd, xtmp1 will become 2,3 in next > iteration so vpermd will place the lower 64 bit from xtmp3 to 127:64 of dst > and so on so forth > I have taken a different approach here based on progressive subtraction to get permute indices for each iteration. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1388901259