Re: RFR: 8318650: Optimized subword gather for x86 targets.

Jatin Bhateja Thu, 26 Oct 2023 22:18:13 -0700

On Thu, 26 Oct 2023 07:21:28 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:


>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and 
>> AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather with high performance backend implementation 
>> based on hybrid algorithm which initially partially unrolls scalar loop to 
>> accumulates values from gather indices into a quadword(64bit) slice followed 
>> by vector permutation to place the slice into appropriate vector lanes, it 
>> prevents code bloating and generates compact
>> JIT sequence. This coupled with savings from expansive array allocation in 
>> existing java implementation translates into significant performance of 
>> 1.3-5x gains with included micro.
>> 
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by 
>> replacing temporary array allocation with zero initialized vector and a 
>> scalar loops which inserts gathered values into vector. But, vector insert 
>> operation in higher vector lanes is a three step process which first 
>> extracts the upper vector 128 bit lane, updates it with gather subword value 
>> and then inserts the lane back to its original position. This makes inserts 
>> into higher order lanes costly w.r.t to proposed solution. In addition 
>> generated JIT code for modified fallback implementation was very bulky. This 
>> may impact in-lining decisions into caller contexts.
>> 
>> 3) Some minor adjustments in existing gather instruction pattens for 
>> double/quad words.
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> 
>> Best Regards,
>> Jatin
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1486:
> 
>> 1484:     // Check whether the predicated gather/scatter node is supported 
>> by architecture.
>> 1485:     VectorMaskUseType mask = (is_scatter || !is_subword_type(elem_bt)) 
>> ? (VectorMaskUseType) (VecMaskUseLoad | VecMaskUsePred) : VecMaskUseLoad;
>> 1486:     if (!arch_supports_vector(is_scatter ? Op_StoreVectorScatterMasked 
>> : Op_LoadVectorGatherMasked, num_elem, elem_bt, mask)) {
> 
> What is the difference between subword-type load gather and others? It seems 
> only check `VecMaskUseLoad` is enough for all kinds of masked gather/scatter. 
> Only checking `is_match_rule_supported_vector` for these ops are ok. WDYT?

sub-word gather do not emit any predicated instructions, thus only 
VectorMaskUseLoad is relevant in this context, however AVX512 and SVE does have 
a direct predicated gather instructions for 32/64 bit types.

> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
> line 3062:
> 
>> 3060:         // Constant folding should sweep out following conditonal 
>> logic.
>> 3061:         if (isp.length() > IntVector.SPECIES_PREFERRED.length()) {
>> 3062:             lsp = IntVector.SPECIES_PREFERRED;
> 
> What happens if the needed index vector length is larger that the max vector 
> length of int type? 
> For example, byte vectors with `SPECIES_128`, may needs an index vector with 
> `IntVector.SPECIES_512` species? And on some architectures like NEON, or SVE 
> with 128-bit max vector size, the preferred species is only `SPECIES_128`.

For x86 backend implementation we emit a gather loop hence index species is not 
used for sub-word, I am performing the index out of bound in a loop which 
operate a max integer vector granularity supported by target.
https://github.com/openjdk/jdk/pull/16354/files#diff-13cc2d6ec18e487ddae05cda671bdb6bb7ffd42ff7bc51a2e00c8c5e622bd55dR3641

Because there is cap on max integral vector size, thus for targets supporting 
direct sub-word gather instruction will have to emit an instruction sequence 
comprising of partial indexMap loads into integral vectors and issue multiple 
gather operations.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1374077991
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1374077927

Re: RFR: 8318650: Optimized subword gather for x86 targets.

Reply via email to