On Thu, 26 Oct 2023 at 04:09, Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
> > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
> > <richard.sandif...@arm.com> wrote:
> >>
> >> Hi,
> >>
> >> Sorry the slow review.  I clearly didn't think this through properly
> >> when doing the review of the original patch, so I wanted to spend
> >> some time working on the code to get a better understanding of
> >> the problem.
> >>
> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
> >> > Hi,
> >> > For the following test-case:
> >> >
> >> > typedef float __attribute__((__vector_size__ (16))) F;
> >> > F foo (F a, F b)
> >> > {
> >> >   F v = (F) { 9 };
> >> >   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> >> > }
> >> >
> >> > Compiling with -O2 results in following ICE:
> >> > foo.c: In function ‘foo’:
> >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
> >> >     6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> >> >       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> > 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
> >> >>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
> >> > const&)
> >> >         ../../gcc/gcc/rtl.h:2314
> >> > 0x7f3185 wide_int_ref_storage<false,
> >> > false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
> >> >>(std::pair<rtx_def*, machine_mode> const&)
> >> >         ../../gcc/gcc/wide-int.h:1089
> >> > 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
> >> >>::generic_wide_int<std::pair<rtx_def*, machine_mode>
> >> >>(std::pair<rtx_def*, machine_mode> const&)
> >> >         ../../gcc/gcc/wide-int.h:847
> >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> >> >>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
> >> >         ../../gcc/gcc/poly-int.h:467
> >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> >> >>(std::pair<rtx_def*, machine_mode> const&)
> >> >         ../../gcc/gcc/poly-int.h:453
> >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
> >> >         ../../gcc/gcc/rtl.h:2383
> >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
> >> >         ../../gcc/gcc/rtx-vector-builder.h:122
> >> > 0xfd4e1b vector_builder<rtx_def*, machine_mode,
> >> > rtx_vector_builder>::elt(unsigned int) const
> >> >         ../../gcc/gcc/vector-builder.h:253
> >> > 0xfd4d11 rtx_vector_builder::build()
> >> >         ../../gcc/gcc/rtx-vector-builder.cc:73
> >> > 0xc21d9c const_vector_from_tree
> >> >         ../../gcc/gcc/expr.cc:13487
> >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> >> > expand_modifier, rtx_def**, bool)
> >> >         ../../gcc/gcc/expr.cc:11059
> >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
> >> >         ../../gcc/gcc/expr.h:310
> >> > 0xaee682 expand_return
> >> >         ../../gcc/gcc/cfgexpand.cc:3809
> >> > 0xaee682 expand_gimple_stmt_1
> >> >         ../../gcc/gcc/cfgexpand.cc:3918
> >> > 0xaee682 expand_gimple_stmt
> >> >         ../../gcc/gcc/cfgexpand.cc:4044
> >> > 0xaf28f0 expand_gimple_basic_block
> >> >         ../../gcc/gcc/cfgexpand.cc:6100
> >> > 0xaf4996 execute
> >> >         ../../gcc/gcc/cfgexpand.cc:6835
> >> >
> >> > IIUC, the issue is that fold_vec_perm returns a vector having float 
> >> > element
> >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries
> >> > to derive element v[3], not present in the encoding, while trying to
> >> > build rtx vector
> >> > in rtx_vector_builder::build():
> >> >  for (unsigned int i = 0; i < nelts; ++i)
> >> >     RTVEC_ELT (v, i) = elt (i);
> >> >
> >> > The attached patch tries to fix this by returning false from
> >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
> >> > input vector has non-integral element type, so for VLA vectors, it
> >> > will only build result with dup sequence (nelts_per_pattern < 3) for
> >> > non-integral element type.
> >> >
> >> > For VLS vectors, this will still work for stepped sequence since it
> >> > will then use the "VLS exception" in fold_vec_perm_cst, and set:
> >> > res_npattern = res_nelts and
> >> > res_nelts_per_pattern = 1
> >> >
> >> > and fold the above case to:
> >> > F foo (F a, F b)
> >> > {
> >> >   <bb 2> [local count: 1073741824]:
> >> >   return { 0.0, 9.0e+0, 0.0, 0.0 };
> >> > }
> >> >
> >> > But I am not sure if this is entirely correct, since:
> >> > tree res = out_elts.build ();
> >> > will canonicalize the encoding and may result in a stepped sequence
> >> > (vector_builder::finalize() may reduce npatterns at the cost of 
> >> > increasing
> >> > nelts_per_pattern)  ?
> >> >
> >> > PS: This issue is now latent after PR111648 fix, since
> >> > valid_mask_for_fold_vec_perm_cst with  sel = {1, 0, 1, ...} returns
> >> > false because the corresponding pattern in arg0 is not a natural
> >> > stepped sequence, and folds correctly using VLS exception. However, I
> >> > guess the underlying issue of dealing with non-integral element types
> >> > in fold_vec_perm_cst still remains ?
> >> >
> >> > The patch passes bootstrap+test with and without SVE on 
> >> > aarch64-linux-gnu,
> >> > and on x86_64-linux-gnu.
> >>
> >> I think the problem is instead in the way that we're calculating
> >> res_npatterns and res_nelts_per_pattern.
> >>
> >> If the selector is a duplication of { a1, ..., an }, then the
> >> result will be a duplication of n elements, regardless of the shape
> >> of the other arguments.
> >>
> >> Similarly, if the selector is { a1, ...., an } followed by a
> >> duplication of { b1, ..., bn }, the result be n elements followed
> >> by a duplication of n elements, regardless of the shape of the other
> >> arguments.
> >>
> >> So for these two cases, res_npatterns and res_nelts_per_pattern
> >> can come directly from the selector's encoding.
> >>
> >> If:
> >>
> >> (1) the selector is an n-pattern stepped sequence
> >> (2) the stepped part of each pattern selects from the same input pattern
> >> (3) the stepped part of each pattern does not select the first element
> >>     of the input pattern, or the full input pattern is stepped
> >>     (your previous patch)
> >>
> >> then the result is stepped only if one of the inputs is stepped.
> >> This is because, if an input pattern has 1 or 2 elements, (3) means
> >> that each element of the stepped sequence will select the same value,
> >> as if the selector step had been 0.
> > Hi Richard,
> > Thanks for the suggestions! I agree when selector is dup of {a1, ... an, 
> > ...} or
> > base elements followed up dup {a1, .. an, b1, ... bn, ...} in that
> > case we can set
> > res_nelts_per_pattern from selector's encoding. However even if we provide
> > more nelts_per_pattern that necessary, I guess vector_builder::finalize() 
> > will
> > canonicalize it to the correct encoding for result ?
>
> Right.  The elements before finalize is called have to be correct,
> but they don't need to be canonical or minimal.
>
> After the change, we'll build no more elements than we did before.
>
> >> So I think the PR could be solved by something like the attached.
> >> Do you agree?  If so, could you base the patch on this instead?
> >>
> >> Only tested against the self-tests.
> >>
> >> Thanks,
> >> Richard
> >>
> >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >> index 40767736389..00fce4945a7 100644
> >> --- a/gcc/fold-const.cc
> >> +++ b/gcc/fold-const.cc
> >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree 
> >> arg1, const vec_perm_indices &sel,
> >>    unsigned res_npatterns, res_nelts_per_pattern;
> >>    unsigned HOST_WIDE_INT res_nelts;
> >>
> >> -  /* (1) If SEL is a suitable mask as determined by
> >> -     valid_mask_for_fold_vec_perm_cst_p, then:
> >> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> >> -     res_nelts_per_pattern = max of nelts_per_pattern between
> >> -                            ARG0, ARG1 and SEL.
> >> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> >> -     res_npatterns = nelts in result vector.
> >> -     res_nelts_per_pattern = 1.
> >> -     This exception is made so that VLS ARG0, ARG1 and SEL work as 
> >> before.  */
> >> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> >> -    {
> >> -      res_npatterns
> >> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
> >> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
> >> -                             sel.encoding ().npatterns ()));
> >> +  /* First try to implement the fold in a VLA-friendly way.
> >> +
> >> +     (1) If the selector is simply a duplication of N elements, the
> >> +        result is likewise a duplication of N elements.
> >> +
> >> +     (2) If the selector is N elements followed by a duplication
> >> +        of N elements, the result is too.
> >>
> >> -      res_nelts_per_pattern
> >> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> >> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> >> -                             sel.encoding ().nelts_per_pattern ()));
> >> +     (3) If the selector is N elements followed by an interleaving
> >> +        of N linear series, the situation is more complex.
> >>
> >> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
> >> +        can handle this case.  If we can, then each of the N linear
> >> +        series either (a) selects the same element each time or
> >> +        (b) selects a linear series from one of the input patterns.
> >> +
> >> +        If (b) holds for one of the linear series, the result
> >> +        will contain a linear series, and so the result will have
> >> +        the same shape as the selector.  If (a) holds for all of
> >> +        the lienar series, the result will be the same as (2) above.
> >> +
> >> +        (b) can only hold if one of the inputs pattern has a
> >> +        stepped encoding.  */
> >> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> >> +    {
> >> +      res_npatterns = sel.encoding ().npatterns ();
> >> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >> +      if (res_nelts_per_pattern == 3
> >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> >> +       res_nelts_per_pattern = 2;
> > Um, in this case, should we set:
> > res_nelts_per_pattern = max (nelts_per_pattern (arg0), 
> > nelts_per_pattern(arg1))
> > if both have nelts_per_pattern == 1 ?
>
> No, it still needs to be 2 even if arg0 and arg1 are duplicates.
> E.g. consider a selector that picks the first element of arg0
> followed by a duplicate of the first element of arg1.
>
> > Also I suppose this matters only for non-integral element type, since
> > for integral element type,
> > vector_cst_elt will return the correct value even if the element is
> > not explicitly encoded and input vector is dup ?
>
> Yeah, but it might help even for integers.  If we build fewer
> elements explicitly, and so read fewer implicitly-encoded inputs,
> there's less risk of running into:
>
>       if (!can_div_trunc_p (sel[i], len, &q, &r))
>         {
>           if (reason)
>             *reason = "cannot divide selector element by arg len";
>           return NULL_TREE;
>         }
Ah right, thanks for the clarification!
I am currently away on vacation and will return next Thursday, and
will post a follow up patch based on your patch.
Sorry for the delay.

Thanks,
Prathamesh
>
> Thanks,
> Richard

Reply via email to