On Tue, 16 Jan 2024 06:08:31 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757: >> >>> 1755: for (int i = 0; i < 4; i++) { >>> 1756: movl(rtmp, Address(idx_base, i * 4)); >>> 1757: pinsrw(dst, Address(base, rtmp, Address::times_2), i); >> >> Do I understand this right that you are basically doing this? >> `dst[i*4 .. i*4 + 3] = load_8bytes(base + (idx_base + i * 4) * 2)` >> But this does not look like a gather, rather like 4 adjacent loads that pack >> the data together into a single 8*4 byte vector. >> >> Why can this not be done by a simple `32bit` load? > > Loop scans over integral index array and pick the work from computed address, > indexes could be non-contiguous. Maybe you could have comment lines that state this, similar like in the documentation? `dst[i] = load(base + 2 * load(idx_base + i * 4))` Or maybe: `dst[i] = base[idx_base[i * 4] * 2]` ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1453013821