Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > On Tue, 25 Jul 2023 at 18:25, Richard Sandiford > <richard.sandif...@arm.com> wrote: >> >> Hi, >> >> Thanks for the rework and sorry for the slow review. > Hi Richard, > Thanks for the suggestions! Please find my responses inline below. >> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: >> > Hi Richard, >> > This is reworking of patch to extend fold_vec_perm to handle VLA vectors. >> > The attached patch unifies handling of VLS and VLA vector_csts, while >> > using fallback code >> > for ctors. >> > >> > For VLS vector, the patch ignores underlying encoding, and >> > uses npatterns = nelts, and nelts_per_pattern = 1. >> > >> > For VLA patterns, if sel has a stepped sequence, then it >> > only chooses elements from a particular pattern of a particular >> > input vector. >> > >> > To make things simpler, the patch imposes following constraints: >> > (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2. >> > (b) The step size for a stepped sequence is a power of 2, and >> > multiple of npatterns of chosen input vector. >> > (c) Runtime vector length of sel is a multiple of sel_npatterns. >> > So, we don't handle sel.length = 2 + 2x and npatterns = 4. >> > >> > Eg: >> > op0, op1: npatterns = 2, nelts_per_pattern = 3 >> > op0_len = op1_len = 16 + 16x. >> > sel = { 0, 0, 2, 0, 4, 0, ... } >> > npatterns = 2, nelts_per_pattern = 3. >> > >> > For pattern {0, 2, 4, ...} >> > Let, >> > a1 = 2 >> > S = step size = 2 >> > >> > Let Esel denote number of elements per pattern in sel at runtime. >> > Esel = (16 + 16x) / npatterns_sel >> > = (16 + 16x) / 2 >> > = (8 + 8x) >> > >> > So, last element of pattern: >> > ae = a1 + (Esel - 2) * S >> > = 2 + (8 + 8x - 2) * 2 >> > = 14 + 16x >> > >> > a1 /trunc arg0_len = 2 / (16 + 16x) = 0 >> > ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0 >> > Since both are equal with quotient = 0, we select elements from op0. >> > >> > Since step size (S) is a multiple of npatterns(op0), we select >> > all elements from same pattern of op0. >> > >> > res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns)) >> > = max (2, max (2, 2) >> > = 2 >> > >> > res_nelts_per_pattern = max (op0_nelts_per_pattern, >> > max (op1_nelts_per_pattern, >> > >> > sel_nelts_per_pattern)) >> > = max (3, max (3, 3)) >> > = 3 >> > >> > So res has encoding with npatterns = 2, nelts_per_pattern = 3. >> > res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... } >> > >> > Unfortunately, this results in an issue for poly_int_cst index: >> > For example, >> > op0, op1: npatterns = 1, nelts_per_pattern = 3 >> > op0_len = op1_len = 4 + 4x >> > >> > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1 >> > >> > In this case, >> > a1 = 5 + 4x >> > S = (6 + 4x) - (5 + 4x) = 1 >> > Esel = 4 + 4x >> > >> > ae = a1 + (esel - 2) * S >> > = (5 + 4x) + (4 + 4x - 2) * 1 >> > = 7 + 8x >> > >> > IIUC, 7 + 8x will always be index for last element of op1 ? >> > if x = 0, len = 4, 7 + 8x = 7 >> > if x = 1, len = 8, 7 + 8x = 15, etc. >> > So the stepped sequence will always choose elements >> > from op1 regardless of vector length for above case ? >> > >> > However, >> > ae /trunc op0_len >> > = (7 + 8x) / (4 + 4x) >> > which is not defined because 7/4 != 8/4 >> > and we return NULL_TREE, but I suppose the expected result would be: >> > res: { op1[0], op1[1], op1[2], ... } ? >> > >> > The patch passes bootstrap+test on aarch64-linux-gnu with and without sve, >> > and on x86_64-unknown-linux-gnu. >> > I would be grateful for suggestions on how to proceed. >> > >> > Thanks, >> > Prathamesh >> > >> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc >> > index a02ede79fed..8028b3e8e9a 100644 >> > --- a/gcc/fold-const.cc >> > +++ b/gcc/fold-const.cc >> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3. If not see >> > #include "vec-perm-indices.h" >> > #include "asan.h" >> > #include "gimple-range.h" >> > +#include <algorithm> >> > +#include "tree-pretty-print.h" >> > +#include "gimple-pretty-print.h" >> > +#include "print-tree.h" >> > >> > /* Nonzero if we are folding constants inside an initializer or a C++ >> > manifestly-constant-evaluated context; zero otherwise. >> > @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, tree >> > expr) >> > static bool >> > vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts) >> > { >> > - unsigned HOST_WIDE_INT i, nunits; >> > + unsigned HOST_WIDE_INT i; >> > >> > - if (TREE_CODE (arg) == VECTOR_CST >> > - && VECTOR_CST_NELTS (arg).is_constant (&nunits)) >> > - { >> > - for (i = 0; i < nunits; ++i) >> > - elts[i] = VECTOR_CST_ELT (arg, i); >> > - } >> > - else if (TREE_CODE (arg) == CONSTRUCTOR) >> > + if (TREE_CODE (arg) == CONSTRUCTOR) >> > { >> > constructor_elt *elt; >> > >> > @@ -10519,6 +10517,230 @@ vec_cst_ctor_to_array (tree arg, unsigned int >> > nelts, tree *elts) >> > return true; >> > } >> > >> > +/* Return a vector with (NPATTERNS, NELTS_PER_PATTERN) encoding. */ >> > + >> > +static tree >> > +vector_cst_reshape (tree vec, unsigned npatterns, unsigned >> > nelts_per_pattern) >> > +{ >> > + gcc_assert (pow2p_hwi (npatterns)); >> > + >> > + if (VECTOR_CST_NPATTERNS (vec) == npatterns >> > + && VECTOR_CST_NELTS_PER_PATTERN (vec) == nelts_per_pattern) >> > + return vec; >> > + >> > + tree v = make_vector (exact_log2 (npatterns), nelts_per_pattern); >> > + TREE_TYPE (v) = TREE_TYPE (vec); >> > + >> > + unsigned nelts = npatterns * nelts_per_pattern; >> > + for (unsigned i = 0; i < nelts; i++) >> > + VECTOR_CST_ENCODED_ELT(v, i) = vector_cst_elt (vec, i); >> > + return v; >> > +} >> > + >> > +/* Helper routine for fold_vec_perm_vla to check if ARG is a suitable >> > + operand for VLA vec_perm folding. If arg is VLS, then set >> > + NPATTERNS = nelts and NELTS_PER_PATTERN = 1. */ >> > + >> > +static tree >> > +valid_operand_for_fold_vec_perm_cst_p (tree arg) >> > +{ >> > + if (TREE_CODE (arg) != VECTOR_CST) >> > + return NULL_TREE; >> > + >> > + unsigned HOST_WIDE_INT nelts; >> > + unsigned npatterns, nelts_per_pattern; >> > + if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant (&nelts)) >> > + { >> > + npatterns = nelts; >> > + nelts_per_pattern = 1; >> > + } >> > + else >> > + { >> > + npatterns = VECTOR_CST_NPATTERNS (arg); >> > + nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg); >> > + } >> > + >> > + if (!pow2p_hwi (npatterns)) >> > + return NULL_TREE; >> > + >> > + return vector_cst_reshape (arg, npatterns, nelts_per_pattern); >> > +} >> >> I don't think we should reshape the vectors for VLS, since it would >> create more nodes for GC to clean up later. Also, the "compact" encoding >> is canonical even for VLS, so the reshaping would effectively create >> noncanonical constants (even if only temporarily). > Removed in the attached patch. >> >> Instead, I think we should change the later: >> >> > + if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns, >> > + sel_nelts_per_pattern, reason, >> > verbose)) >> > + return NULL_TREE; >> >> so that it comes after the computation of res_npatterns and >> res_nelts_per_pattern. Then, if valid_mask_for_fold_vec_perm_cst_p >> returns false, and if the result type has a constant number of elements, >> we should: >> >> * set res_npatterns to that number of elements >> * set res_nelts_per_pattern to 1 >> * continue instead of returning null > Assuming we don't enforce only VLA or only VLS for input vectors and sel, > won't that be still an issue if res (and sel) is VLS, and input > vectors are VLA ? > For eg: > arg0, arg1 are type VNx4SI with npatterns = 2, nelts_per_pattern = 3, step = 2 > sel is V4SI constant with encoding { 0, 2, 4, ... } > and res_type is V4SI. > In this case, when it comes to index 4, the vector selection becomes > ambiguous, > since it can be arg1 for len = 4 + 4x, and arg0 for lengths > 4 + 4x ?
Ah, right. So the condition is whether the result type and the data input type have a constant number of elements, rather than just the result. > In the attached patch if sel is not a suitable mask, and input vectors > have constant length, then it sets: > res_npatterns = nelts of input vector > res_nelts_per_pattern = 1 > Does that look OK ? > This part of the code has few tests in test_mixed(). res_npatterns should be nelts of the result vector, not the input. >> The loop that follows will then do the correct thing for each element. >> >> The check for a power of 2 would then go in >> valid_mask_for_fold_vec_perm_cst_p rather than >> valid_operand_for_fold_vec_perm_cst_p. >> >> With that change, I think: >> >> > + >> > +/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable >> > + mask for VLA vec_perm folding. Set SEL_NPATTERNS and >> > SEL_NELTS_PER_PATTERN >> > + similarly. */ >> > + >> > +static bool >> > +valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1, >> > + const vec_perm_indices &sel, >> > + unsigned& sel_npatterns, >> > + unsigned& sel_nelts_per_pattern, >> > + char *reason = NULL, >> > + bool verbose = false) >> > +{ >> > + unsigned HOST_WIDE_INT nelts; >> > + if (sel.length ().is_constant (&nelts)) >> > + { >> > + sel_npatterns = nelts; >> > + sel_nelts_per_pattern = 1; >> > + } >> > + else >> > + { >> > + sel_npatterns = sel.encoding ().npatterns (); >> > + sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern (); >> > + } >> >> ...we should use the "else" code unconditionally. > Done. >> >> The function comment should describe "reason" and "verbose". It looks >> like these are debug parameters, is that right? > Yes, verbose is just for printf debugging, and reason is used in unit-tests > when > result is NULL_TREE, to verify that it's NULL for the intended reason, > and not due to something else. Could we make it a "const char **" instead, and just store the string pointers there? It looks like all the reasons are fixed strings rather than dynamic. >> >> > + >> > + if (!pow2p_hwi (sel_npatterns)) >> > + { >> > + if (reason) >> > + strcpy (reason, "sel_npatterns is not power of 2"); >> > + return false; >> > + } >> > + >> > + /* We want to avoid cases where sel.length is not a multiple of >> > npatterns. >> > + For eg: sel.length = 2 + 2x, and sel npatterns = 4. */ >> > + poly_uint64 esel; >> > + if (!multiple_p (sel.length (), sel_npatterns, &esel)) >> > + { >> > + if (reason) >> > + strcpy (reason, "sel.length is not multiple of sel_npatterns"); >> > + return false; >> > + } >> > + >> > + if (sel_nelts_per_pattern < 3) >> > + return true; >> > + >> > + for (unsigned pattern = 0; pattern < sel_npatterns; pattern++) >> > + { >> > + poly_uint64 a1 = sel[pattern + sel_npatterns]; >> > + poly_uint64 a2 = sel[pattern + 2 * sel_npatterns]; >> > + >> > + poly_uint64 step = a2 - a1; >> > + if (!step.is_constant ()) >> > + { >> > + if (reason) >> > + strcpy (reason, "step is not constant"); >> > + return false; >> > + } >> > + int S = step.to_constant (); >> > + if (S == 0) >> > + continue; >> >> Might be simpler as: >> >> HOST_WIDE_INT step; >> if (!(a2 - a1).is_constant (&step)) >> { >> ... >> } > This resulted in following error: > ../../gcc/gcc/fold-const.cc:10563:34: error: no matching function for > call to ‘poly_int<2, long unsigned int>::is_constant(long int*)’ > 10563 | if (!(a2 - a1).is_constant (&S)) > | ~~~~~~~~~~~~~~~~~~~~~~^~~~ OK: HOST_WIDE_INT step; if (!poly_int64 (a2 - a1).is_constant (&step)) > Could you please elaborate on the following issue, because it still > stands in this patch, > which is in Case 2 of test_fold_vec_perm_cst::test_stepped. > > For example, > op0, op1: npatterns = 1, nelts_per_pattern = 3 > op0_len = op1_len = 4 + 4x > > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1 > > In this case, > a1 = 5 + 4x > S = (6 + 4x) - (5 + 4x) = 1 > Esel = 4 + 4x > > ae = a1 + (esel - 2) * S > = (5 + 4x) + (4 + 4x - 2) * 1 > = 7 + 8x > > IIUC, 7 + 8x will always be index for last element of op1 ? > if x = 0, len = 4, 7 + 8x = 7 > if x = 1, len = 8, 7 + 8x = 15, etc. > So the stepped sequence will always choose elements > from op1 regardless of vector length for above case ? > > However, > ae /trunc op0_len > = (7 + 8x) / (4 + 4x) > which is not defined because 7/4 != 8/4 Ah, that was me being lazy and not handling a boundary case that seemed difficult to keep overflow-free. I've now pushed a fix. Thanks, Richard