Re: RFR: 8318650: Optimized subword gather for x86 targets.

Xiaohong Gong Thu, 26 Oct 2023 00:47:58 -0700

On Wed, 25 Oct 2023 04:34:59 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


> Hi All,
> 
> This patch optimizes sub-word gather operation for x86 targets with AVX2 and 
> AVX512 features.
> 
> Following is the summary of changes:-
> 
> 1) Intrinsify sub-word gather with high performance backend implementation 
> based on hybrid algorithm which initially partially unrolls scalar loop to 
> accumulates values from gather indices into a quadword(64bit) slice followed 
> by vector permutation to place the slice into appropriate vector lanes, it 
> prevents code bloating and generates compact
> JIT sequence. This coupled with savings from expansive array allocation in 
> existing java implementation translates into significant performance of 
> 1.3-5x gains with included micro.
> 
> 
> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
> 
> 
> 2) Patch was also compared against modified java fallback implementation by 
> replacing temporary array allocation with zero initialized vector and a 
> scalar loops which inserts gathered values into vector. But, vector insert 
> operation in higher vector lanes is a three step process which first extracts 
> the upper vector 128 bit lane, updates it with gather subword value and then 
> inserts the lane back to its original position. This makes inserts into 
> higher order lanes costly w.r.t to proposed solution. In addition generated 
> JIT code for modified fallback implementation was very bulky. This may impact 
> in-lining decisions into caller contexts.
> 
> 3) Some minor adjustments in existing gather instruction pattens for 
> double/quad words.
> 
> 
> Kindly review and share your feedback.
> 
> 
> Best Regards,
> Jatin

src/hotspot/share/opto/vectorIntrinsics.cpp line 1486:

> 1484:     // Check whether the predicated gather/scatter node is supported by 
> architecture.
> 1485:     VectorMaskUseType mask = (is_scatter || !is_subword_type(elem_bt)) 
> ? (VectorMaskUseType) (VecMaskUseLoad | VecMaskUsePred) : VecMaskUseLoad;
> 1486:     if (!arch_supports_vector(is_scatter ? Op_StoreVectorScatterMasked 
> : Op_LoadVectorGatherMasked, num_elem, elem_bt, mask)) {

What is the difference between subword-type load gather and others? It seems 
only check `VecMaskUseLoad` is enough for all kinds of masked gather/scatter. 
Only checking `is_match_rule_supported_vector` for these ops are ok. WDYT?

src/hotspot/share/opto/vectorIntrinsics.cpp line 1599:

> 1597:       if (is_subword_type(elem_bt)) {
> 1598:         Node* index_arr_base = array_element_address(argument(12), 
> argument(13), T_INT);
> 1599:         vload = gvn().transform(new 
> LoadVectorGatherMaskedNode(control(), memory(addr), addr, addr_type, 
> vector_type, index_arr_base, mask, argument(11)));

It's better to define the meaningful variables for `argument(12), argument(13), 
argument(11)`.

src/hotspot/share/opto/vectornode.hpp line 1001:

> 999:     assert(req() == MemNode::ValueIn + 2, "match_edge expects that last 
> input is in MemNode::ValueIn+1");
> 1000:     if (is_subword_type(vect_type()->is_vect()->element_basic_type())) {
> 1001:       add_req(offset);

`vect_type()->is_vect()`  -> `vt` ?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java 
line 3062:

> 3060:         // Constant folding should sweep out following conditonal logic.
> 3061:         if (isp.length() > IntVector.SPECIES_PREFERRED.length()) {
> 3062:             lsp = IntVector.SPECIES_PREFERRED;

What happens if the needed index vector length is larger that the max vector 
length of int type? 
For example, byte vectors with `SPECIES_128`, may needs an index vector with 
`IntVector.SPECIES_512` species? And on some architectures like NEON, or SVE 
with 128-bit max vector size, the preferred species is only `SPECIES_128`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1372703788
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1372705631
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1372710351
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1372720580

Re: RFR: 8318650: Optimized subword gather for x86 targets.

Reply via email to