On Tue, 27 Feb 2024 02:47:13 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >> partially unrolls scalar loop to accumulates values from gather indices into >> a quadword(64bit) slice followed by vector permutation to place the slice >> into appropriate vector lanes, it prevents code bloating and generates >> compact JIT sequence. This coupled with savings from expansive array >> allocation in existing java implementation translates into significant >> performance of 1.5-10x gains with included micro. >> >>  >> >> >> 2) Patch was also compared against modified java fallback implementation by >> replacing temporary array allocation with zero initialized vector and a >> scalar loops which inserts gathered values into vector. But, vector insert >> operation in higher vector lanes is a three step process which first >> extracts the upper vector 128 bit lane, updates it with gather subword value >> and then inserts the lane back to its original position. This makes inserts >> into higher order lanes costly w.r.t to proposed solution. In addition >> generated JIT code for modified fallback implementation was very bulky. This >> may impact in-lining decisions into caller contexts. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Review comment resolutions. I sent the patch for testing. And added some suggestions for comments in `C2_MacroAssembler::vgather_subword`. I hope I have not been annoying you too much 🙈 src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1672: > 1670: Label GATHER8_LOOP; > 1671: XMMRegister iota = xtmp1; > 1672: XMMRegister two_vec = xtmp2; I'm sorry to bother you so much with this. I think adding aliases that don't mention what register they alias to makes things worse. Now I can't even see if two names might alias to the same register. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1672: > 1670: Label GATHER8_LOOP; > 1671: XMMRegister iota = xtmp1; > 1672: XMMRegister two_vec = xtmp2; Suggestion: src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1681: > 1679: vpsubd(xtmp2, xtmp1, xtmp2, vlen_enc); > 1680: vpslld(two_vec, xtmp2, 1, vlen_enc); > 1681: load_iota_indices(iota, vector_len * type2aelembytes(elem_ty), T_INT); Suggestion: vpxor(xtmp1, xtmp1, xtmp1, vlen_enc); // xtmp1 = {0, ...} vpxor(dst, dst, dst, vlen_enc); // dst = {0, ...} vallones(xtmp2, vlen_enc); vpsubd(xtmp2, xtmp1, xtmp2, vlen_enc); vpslld(xtmp2, xtmp2, 1, vlen_enc); // xtmp2 = {2, 2, ...} load_iota_indices(xtmp1, vector_len * type2aelembytes(elem_ty), T_INT); // xtmp1 = {0, 1, 2, ...} src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1688: > 1686: } else { > 1687: LP64_ONLY(vgather8b_masked_offset(elem_ty, temp_dst, base, > idx_base, offset, mask, mask_idx, rtmp, vlen_enc)); > 1688: } Suggestion: // TMP_VEC_64(temp_dst) = PICK_SUB_WORDS_FROM_GATHER_INDICES if (mask == noreg) { vgather8b_offset(elem_ty, temp_dst, base, idx_base, offset, rtmp, vlen_enc); } else { LP64_ONLY(vgather8b_masked_offset(elem_ty, temp_dst, base, idx_base, offset, mask, mask_idx, rtmp, vlen_enc)); } src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1694: > 1692: addptr(idx_base, 32 >> (type2aelembytes(elem_ty) - 1)); > 1693: subl(length, 8 >> (type2aelembytes(elem_ty) - 1)); > 1694: jcc(Assembler::notEqual, GATHER8_LOOP); Suggestion: // TEMP_PERM_VEC(temp_dst) = PERMUTE TMP_VEC_64(temp_dst) PERM_INDEX(xtmp1) vpermd(temp_dst, xtmp1, temp_dst, vlen_enc == Assembler::AVX_512bit ? vlen_enc : Assembler::AVX_256bit); // PERM_INDEX(xtmp1) = PERM_INDEX(xtmp1) - TWO_VEC(xtmp2) vpsubd(xtmp1, xtmp1, xtmp2, vlen_enc); // DST_VEC = DST_VEC OR TEMP_PERM_VEC vpor(dst, dst, temp_dst, vlen_enc); addptr(idx_base, 32 >> (type2aelembytes(elem_ty) - 1)); subl(length, 8 >> (type2aelembytes(elem_ty) - 1)); jcc(Assembler::notEqual, GATHER8_LOOP); ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1903117002 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1503996753 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504001356 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504007501 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504019352 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504017498