On Mon, 1 Jan 2024 14:36:06 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >> partially unrolls scalar loop to accumulates values from gather indices into >> a quadword(64bit) slice followed by vector permutation to place the slice >> into appropriate vector lanes, it prevents code bloating and generates >> compact JIT sequence. This coupled with savings from expansive array >> allocation in existing java implementation translates into significant >> performance of 1.5-10x gains with included micro on Intel Atom family CPUs >> and with JVM option UseAVX=2. >> >>  >> >> >> 2) For AVX512 targets algorithm uses integral gather instructions to load >> values from normalized indices which are multiple of integer size, followed >> by shuffling and packing exact sub-word values from integral lanes. >> >> 3) Patch was also compared against modified java fallback implementation by >> replacing temporary array allocation with zero initialized vector and a >> scalar loops which inserts gathered values into vector. But, vector insert >> operation in higher vector lanes is a three step process which first >> extracts the upper vector 128 bit lane, updates it with gather subword value >> and then inserts the lane back to its original position. This makes inserts >> into higher order lanes costly w.r.t to proposed solution. In addition >> generated JIT code for modified fallback implementation was very bulky. This >> may impact in-lining decisions into caller contexts. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a > merge or a rebase. The pull request now contains 12 commits: > > - Accelerating masked sub-word gathers for AVX2 targets, this gives > additional 1.5-4x speedups over existing implementation. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650 > - Removing JDK-8321648 related changes. > - Refined AVX3 implementation with integral gather. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8318650 > - Fix incorrect comment > - Review comments resolutions. > - Review comments resolutions. > - Review comments resolutions. > - Restricting masked sub-word gather to AVX512 target to align with integral > gather support. > - ... and 2 more: https://git.openjdk.org/jdk/compare/518ec971...de47076e Just had a quick look at this. Is there any support for gather with different indices for each element in the vector? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1627: > 1625: vpsrlvd(dst, dst, xtmp, vlen_enc); > 1626: // Pack double word vector into byte vector. > 1627: vpackI2X(T_BYTE, dst, ones, xtmp, vlen_enc); I would prefer if there was less code duplication here. I think there are just a few values which you could set to variables, and then apply for both versions. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1634: > 1632: Register offset, > XMMRegister offset_vec, XMMRegister idx_vec, > 1633: XMMRegister xtmp1, > XMMRegister xtmp2, XMMRegister xtmp3, KRegister mask, > 1634: KRegister gmask, > int vlen_enc, int vlen) { Would you mind giving a quick summary of what the input registers are and what exactly this method does? Why do we need to call `vgather_subword_avx3` so many times (`lane_count_subwords`)? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757: > 1755: for (int i = 0; i < 4; i++) { > 1756: movl(rtmp, Address(idx_base, i * 4)); > 1757: pinsrw(dst, Address(base, rtmp, Address::times_2), i); Do I understand this right that you are basically doing this? `dst[i*4 .. i*4 + 3] = load_8bytes(base + (idx_base + i * 4) * 2)` But this does not look like a gather, rather like 4 adjacent loads that pack the data together into a single 8*4 byte vector. If so, maybe you should either leave a comment, or even rename the method. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1776: > 1774: for (int i = 0; i < 4; i++) { > 1775: movl(rtmp, Address(idx_base, i * 4)); > 1776: addl(rtmp, offset); Can the `offset` not be added to `idx_base` before the loop? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1900: > 1898: vgather8b(elem_ty, xtmp3, base, idx_base, rtmp, vlen_enc); > 1899: } else { > 1900: LP64_ONLY(vgather8b_masked(elem_ty, xtmp3, base, idx_base, mask, > midx, rtmp, vlen_enc)); What happens if if not `LP64_ONLY`? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1821723578 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452399791 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452425355 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452440206 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452441071 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1452443784