On Thu, 26 Oct 2023 07:21:28 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
>> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather with high performance backend implementation >> based on hybrid algorithm which initially partially unrolls scalar loop to >> accumulates values from gather indices into a quadword(64bit) slice followed >> by vector permutation to place the slice into appropriate vector lanes, it >> prevents code bloating and generates compact >> JIT sequence. This coupled with savings from expansive array allocation in >> existing java implementation translates into significant performance of >> 1.3-5x gains with included micro. >> >> >>  >> >> >> 2) Patch was also compared against modified java fallback implementation by >> replacing temporary array allocation with zero initialized vector and a >> scalar loops which inserts gathered values into vector. But, vector insert >> operation in higher vector lanes is a three step process which first >> extracts the upper vector 128 bit lane, updates it with gather subword value >> and then inserts the lane back to its original position. This makes inserts >> into higher order lanes costly w.r.t to proposed solution. In addition >> generated JIT code for modified fallback implementation was very bulky. This >> may impact in-lining decisions into caller contexts. >> >> 3) Some minor adjustments in existing gather instruction pattens for >> double/quad words. >> >> >> Kindly review and share your feedback. >> >> >> Best Regards, >> Jatin > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1486: > >> 1484: // Check whether the predicated gather/scatter node is supported >> by architecture. >> 1485: VectorMaskUseType mask = (is_scatter || !is_subword_type(elem_bt)) >> ? (VectorMaskUseType) (VecMaskUseLoad | VecMaskUsePred) : VecMaskUseLoad; >> 1486: if (!arch_supports_vector(is_scatter ? Op_StoreVectorScatterMasked >> : Op_LoadVectorGatherMasked, num_elem, elem_bt, mask)) { > > What is the difference between subword-type load gather and others? It seems > only check `VecMaskUseLoad` is enough for all kinds of masked gather/scatter. > Only checking `is_match_rule_supported_vector` for these ops are ok. WDYT? sub-word gather do not emit any predicated instructions, thus only VectorMaskUseLoad is relevant in this context, however AVX512 and SVE does have a direct predicated gather instructions for 32/64 bit types. > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java > line 3062: > >> 3060: // Constant folding should sweep out following conditonal >> logic. >> 3061: if (isp.length() > IntVector.SPECIES_PREFERRED.length()) { >> 3062: lsp = IntVector.SPECIES_PREFERRED; > > What happens if the needed index vector length is larger that the max vector > length of int type? > For example, byte vectors with `SPECIES_128`, may needs an index vector with > `IntVector.SPECIES_512` species? And on some architectures like NEON, or SVE > with 128-bit max vector size, the preferred species is only `SPECIES_128`. For x86 backend implementation we emit a gather loop hence index species is not used for sub-word, I am performing the index out of bound in a loop which operate a max integer vector granularity supported by target. https://github.com/openjdk/jdk/pull/16354/files#diff-13cc2d6ec18e487ddae05cda671bdb6bb7ffd42ff7bc51a2e00c8c5e622bd55dR3641 Because there is cap on max integral vector size, thus for targets supporting direct sub-word gather instruction will have to emit an instruction sequence comprising of partial indexMap loads into integral vectors and issue multiple gather operations. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1374077991 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1374077927