On Wed, 25 Oct 2023 04:34:59 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
> Hi All, > > This patch optimizes sub-word gather operation for x86 targets with AVX2 and > AVX512 features. > > Following is the summary of changes:- > > 1) Intrinsify sub-word gather with high performance backend implementation > based on hybrid algorithm which initially partially unrolls scalar loop to > accumulates values from gather indices into a quadword(64bit) slice followed > by vector permutation to place the slice into appropriate vector lanes, it > prevents code bloating and generates compact > JIT sequence. This coupled with savings from expansive array allocation in > existing java implementation translates into significant performance of > 1.3-5x gains with included micro. > > >  > > > 2) Patch was also compared against modified java fallback implementation by > replacing temporary array allocation with zero initialized vector and a > scalar loops which inserts gathered values into vector. But, vector insert > operation in higher vector lanes is a three step process which first extracts > the upper vector 128 bit lane, updates it with gather subword value and then > inserts the lane back to its original position. This makes inserts into > higher order lanes costly w.r.t to proposed solution. In addition generated > JIT code for modified fallback implementation was very bulky. This may impact > in-lining decisions into caller contexts. > > 3) Some minor adjustments in existing gather instruction pattens for > double/quad words. > > > Kindly review and share your feedback. > > > Best Regards, > Jatin src/hotspot/share/opto/vectorIntrinsics.cpp line 1486: > 1484: // Check whether the predicated gather/scatter node is supported by > architecture. > 1485: VectorMaskUseType mask = (is_scatter || !is_subword_type(elem_bt)) > ? (VectorMaskUseType) (VecMaskUseLoad | VecMaskUsePred) : VecMaskUseLoad; > 1486: if (!arch_supports_vector(is_scatter ? Op_StoreVectorScatterMasked > : Op_LoadVectorGatherMasked, num_elem, elem_bt, mask)) { What is the difference between subword-type load gather and others? It seems only check `VecMaskUseLoad` is enough for all kinds of masked gather/scatter. Only checking `is_match_rule_supported_vector` for these ops are ok. WDYT? src/hotspot/share/opto/vectorIntrinsics.cpp line 1599: > 1597: if (is_subword_type(elem_bt)) { > 1598: Node* index_arr_base = array_element_address(argument(12), > argument(13), T_INT); > 1599: vload = gvn().transform(new > LoadVectorGatherMaskedNode(control(), memory(addr), addr, addr_type, > vector_type, index_arr_base, mask, argument(11))); It's better to define the meaningful variables for `argument(12), argument(13), argument(11)`. src/hotspot/share/opto/vectornode.hpp line 1001: > 999: assert(req() == MemNode::ValueIn + 2, "match_edge expects that last > input is in MemNode::ValueIn+1"); > 1000: if (is_subword_type(vect_type()->is_vect()->element_basic_type())) { > 1001: add_req(offset); `vect_type()->is_vect()` -> `vt` ? src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3062: > 3060: // Constant folding should sweep out following conditonal logic. > 3061: if (isp.length() > IntVector.SPECIES_PREFERRED.length()) { > 3062: lsp = IntVector.SPECIES_PREFERRED; What happens if the needed index vector length is larger that the max vector length of int type? For example, byte vectors with `SPECIES_128`, may needs an index vector with `IntVector.SPECIES_512` species? And on some architectures like NEON, or SVE with 128-bit max vector size, the preferred species is only `SPECIES_128`. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1372703788 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1372705631 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1372710351 PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1372720580