Re: RFR: 8318650: Optimized subword gather for x86 targets. [v16]

Emanuel Peter Tue, 27 Feb 2024 02:49:12 -0800

On Tue, 27 Feb 2024 02:47:13 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and 
>> AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially 
>> partially unrolls scalar loop to accumulates values from gather indices into 
>> a quadword(64bit) slice followed by vector permutation to place the slice 
>> into appropriate vector lanes, it prevents code bloating and generates 
>> compact JIT sequence. This coupled with savings from expansive array 
>> allocation in existing java implementation translates into significant 
>> performance of 1.5-10x gains with included micro.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by 
>> replacing temporary array allocation with zero initialized vector and a 
>> scalar loops which inserts gathered values into vector. But, vector insert 
>> operation in higher vector lanes is a three step process which first 
>> extracts the upper vector 128 bit lane, updates it with gather subword value 
>> and then inserts the lane back to its original position. This makes inserts 
>> into higher order lanes costly w.r.t to proposed solution. In addition 
>> generated JIT code for modified fallback implementation was very bulky. This 
>> may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Review comment resolutions.

I sent the patch for testing.
And added some suggestions for comments in `C2_MacroAssembler::vgather_subword`.
I hope I have not been annoying you too much 🙈

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1672:

> 1670:   Label GATHER8_LOOP;
> 1671:   XMMRegister iota = xtmp1;
> 1672:   XMMRegister two_vec = xtmp2;

I'm sorry to bother you so much with this. I think adding aliases that don't 
mention what register they alias to makes things worse. Now I can't even see if 
two names might alias to the same register.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1672:

> 1670:   Label GATHER8_LOOP;
> 1671:   XMMRegister iota = xtmp1;
> 1672:   XMMRegister two_vec = xtmp2;

Suggestion:

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1681:

> 1679:   vpsubd(xtmp2, xtmp1, xtmp2, vlen_enc);
> 1680:   vpslld(two_vec, xtmp2, 1, vlen_enc);
> 1681:   load_iota_indices(iota, vector_len * type2aelembytes(elem_ty), T_INT);

Suggestion:

  vpxor(xtmp1, xtmp1, xtmp1, vlen_enc); // xtmp1 = {0, ...}
  vpxor(dst, dst, dst, vlen_enc); // dst = {0, ...}
  vallones(xtmp2, vlen_enc);
  vpsubd(xtmp2, xtmp1, xtmp2, vlen_enc);
  vpslld(xtmp2, xtmp2, 1, vlen_enc); // xtmp2 = {2, 2, ...}
  load_iota_indices(xtmp1, vector_len * type2aelembytes(elem_ty), T_INT); // 
xtmp1 = {0, 1, 2, ...}

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1688:

> 1686:     } else {
> 1687:       LP64_ONLY(vgather8b_masked_offset(elem_ty, temp_dst, base, 
> idx_base, offset, mask, mask_idx, rtmp, vlen_enc));
> 1688:     }

Suggestion:

    // TMP_VEC_64(temp_dst) = PICK_SUB_WORDS_FROM_GATHER_INDICES
    if (mask == noreg) {
      vgather8b_offset(elem_ty, temp_dst, base, idx_base, offset, rtmp, 
vlen_enc);
    } else {
      LP64_ONLY(vgather8b_masked_offset(elem_ty, temp_dst, base, idx_base, 
offset, mask, mask_idx, rtmp, vlen_enc));
    }

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1694:

> 1692:     addptr(idx_base,  32 >> (type2aelembytes(elem_ty) - 1));
> 1693:     subl(length, 8 >> (type2aelembytes(elem_ty) - 1));
> 1694:     jcc(Assembler::notEqual, GATHER8_LOOP);

Suggestion:

    // TEMP_PERM_VEC(temp_dst) = PERMUTE TMP_VEC_64(temp_dst) PERM_INDEX(xtmp1)
    vpermd(temp_dst, xtmp1, temp_dst, vlen_enc == Assembler::AVX_512bit ? 
vlen_enc : Assembler::AVX_256bit);

    // PERM_INDEX(xtmp1) = PERM_INDEX(xtmp1) - TWO_VEC(xtmp2)
    vpsubd(xtmp1, xtmp1, xtmp2, vlen_enc);

    // DST_VEC = DST_VEC OR TEMP_PERM_VEC
    vpor(dst, dst, temp_dst, vlen_enc);

    addptr(idx_base,  32 >> (type2aelembytes(elem_ty) - 1));
    subl(length, 8 >> (type2aelembytes(elem_ty) - 1));
    jcc(Assembler::notEqual, GATHER8_LOOP);

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1903117002
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1503996753
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504001356
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504007501
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504019352
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1504017498

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v16]

Reply via email to