Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

Richard Biener via Gcc-patches Wed, 12 Jul 2023 23:39:42 -0700

On Wed, 12 Jul 2023, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > The PRs ask for optimizing of
> >
> >   _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
> >   result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
> >
> > to a vector permutation.  The following implements this as
> > match.pd pattern, improving code generation on x86_64.
> >
> > On the RTL level we face the issue that backend patterns inconsistently
> > use vec_merge and vec_select of vec_concat to represent permutes.
> 
> Yeah, the current RTL codes probably overlap a bit too much.
> 
> Maybe we should have a rule that a vec_merge with a constant
> third operand should be canonicalised to a vec_select?


But vec_merge always has a constant third operand:

@findex vec_merge
@item (vec_merge:@var{m} @var{vec1} @var{vec2} @var{items})
This describes a merge operation between two vectors.  The result is a 
vector
of mode @var{m}; its elements are selected from either @var{vec1} or
@var{vec2}.  Which elements are selected is described by @var{items}, 
which
is a bit mask represented by a @code{const_int}; a zero bit indicates the
corresponding element in the result vector is taken from @var{vec2} while
a set bit indicates it is taken from @var{vec1}.

the "advantage" of vec_merge over vec_concat + vec_select is
that you don't need the 2x wider vector mode, but that's the
only one.  I guess we could allow a mode-less (VOIDmode) vec_concat as
the first operand of a vec_select since that mode isn't really
used for anything.

That said, we could work around the issue by having combine
also try to match vec_merge when the vec_select + vec_concat
combination is a blend.  But I fear that doesn't resonate well
with Segher.

>  And maybe
> change the first operand of vec_select to be an rtvec, so that
> no separate vec_concat (and thus wider mode) is needed for two-input
> permutes?  Would be a lot of work though...
> 
> > I think using a (supported) permute is almost always better
> > than an extract plus insert, maybe excluding the case we extract
> > element zero and that's aliased to a register that can be used
> > directly for insertion (not sure how to query that).
> 
> Yeah, extraction of the low element (0 for LE, N-1 for BE) is special
> in RTL, in that it is now folded to a subreg.  But IMO it's reasonable
> for even that case to through TARGET_VECTORIZE_VEC_PERM_CONST,
> maybe with a target-independent helper function to match permute
> vectors that are equivalent to extract-and-insert.
> 
> On AArch64, extract-and-insert is a single operation for other
> elements too, e.g.:
> 
>       ins     v0.s[2], v1.s[1]
> 
> is a thing.  But if the helper returns the index of the extracted
> elements, targets can decide for themselves whether the index is
> supported or not.
> 
> Agree that this is the right thing for gimple to do FWIW.

I think so as well.  Btw, I think only proper re-association and
merging will handle a full sequence of select, merge and permute
optimally.  In principle we have the bswap pass facility for this.

Richard.

> Thanks,
> Richard
> 
> > But this regresses for example gcc.target/i386/pr54855-8.c because PRE
> > now realizes that
> >
> >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> >   if (_1 > a_4(D))
> >     goto <bb 3>; [50.00%]
> >   else
> >     goto <bb 4>; [50.00%]
> >
> >   <bb 3> [local count: 536870913]:
> >
> >   <bb 4> [local count: 1073741824]:
> >   # iftmp.0_2 = PHI <_1(3), a_4(D)(2)>
> >   x_5 = BIT_INSERT_EXPR <x_3(D), iftmp.0_2, 0>;
> >
> > is equal to
> >
> >   <bb 2> [local count: 1073741824]:
> >   _1 = BIT_FIELD_REF <x_3(D), 64, 0>;
> >   if (_1 > a_4(D))
> >     goto <bb 4>; [50.00%]
> >   else
> >     goto <bb 3>; [50.00%]
> >
> >   <bb 3> [local count: 536870912]:
> >   _7 = BIT_INSERT_EXPR <x_3(D), a_4(D), 0>;
> >
> >   <bb 4> [local count: 1073741824]:
> >   # prephitmp_8 = PHI <x_3(D)(2), _7(3)>
> >
> > and that no longer produces the desired maxsd operation at the RTL
> > level (we fail to match .FMAX at the GIMPLE level earlier).
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu with regressions:
> >
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-times vmaxsh[ \\\\t] 1
> > FAIL: gcc.target/i386/pr54855-13.c scan-assembler-not vcomish[ \\\\t]
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-times maxsd 1
> > FAIL: gcc.target/i386/pr54855-8.c scan-assembler-not movsd
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-times minss 1
> > FAIL: gcc.target/i386/pr54855-9.c scan-assembler-not movss
> >
> > I think this is also PR88540 (the lack of min/max detection, not
> > sure if the SSE min/max are suitable here)
> >
> >     PR tree-optimization/94864
> >     PR tree-optimization/94865
> >     * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
> >     for vector insertion from vector extraction.
> >
> >     * gcc.target/i386/pr94864.c: New testcase.
> >     * gcc.target/i386/pr94865.c: Likewise.
> > ---
> >  gcc/match.pd                            | 25 +++++++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr94864.c | 13 +++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr94865.c | 13 +++++++++++++
> >  3 files changed, 51 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94864.c
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr94865.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 8543f777a28..8cc106049c4 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -7770,6 +7770,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >                   wi::to_wide (@ipos) + isize))
> >      (BIT_FIELD_REF @0 @rsize @rpos)))))
> >  
> > +/* Simplify vector inserts of other vector extracts to a permute.  */
> > +(simplify
> > + (bit_insert @0 (BIT_FIELD_REF@2 @1 @rsize @rpos) @ipos)
> > + (if (VECTOR_TYPE_P (type)
> > +      && types_match (@0, @1)
> > +      && types_match (TREE_TYPE (TREE_TYPE (@0)), TREE_TYPE (@2))
> > +      && TYPE_VECTOR_SUBPARTS (type).is_constant ())
> > +  (with
> > +   {
> > +     unsigned HOST_WIDE_INT elsz
> > +       = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (TREE_TYPE (@1))));
> > +     poly_uint64 relt = exact_div (tree_to_poly_uint64 (@rpos), elsz);
> > +     poly_uint64 ielt = exact_div (tree_to_poly_uint64 (@ipos), elsz);
> > +     unsigned nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > +     vec_perm_builder builder;
> > +     builder.new_vector (nunits, nunits, 1);
> > +     for (unsigned i = 0; i < nunits; ++i)
> > +       builder.quick_push (known_eq (ielt, i) ? nunits + relt : i);
> > +     vec_perm_indices sel (builder, 2, nunits);
> > +   }
> > +   (if (!VECTOR_MODE_P (TYPE_MODE (type))
> > +   || can_vec_perm_const_p (TYPE_MODE (type), TYPE_MODE (type), sel, 
> > false))
> > +    (vec_perm @0 @1 { vec_perm_indices_to_tree
> > +                        (build_vector_type (ssizetype, nunits), sel); 
> > })))))
> > +
> >  (if (canonicalize_math_after_vectorization_p ())
> >   (for fmas (FMA)
> >    (simplify
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94864.c 
> > b/gcc/testsuite/gcc.target/i386/pr94864.c
> > new file mode 100644
> > index 00000000000..69cb481fcfe
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr94864.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > +
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df move_sd(v2df a, v2df b)
> > +{
> > +    v2df result = a;
> > +    result[0] = b[1];
> > +    return result;
> > +}
> > +
> > +/* { dg-final { scan-assembler "unpckhpd\[\\t \]%xmm0, %xmm1" } } */
> > diff --git a/gcc/testsuite/gcc.target/i386/pr94865.c 
> > b/gcc/testsuite/gcc.target/i386/pr94865.c
> > new file mode 100644
> > index 00000000000..84065ac2467
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr94865.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -msse2 -mno-avx" } */
> > +
> > +typedef double v2df __attribute__((vector_size(16)));
> > +
> > +v2df move_sd(v2df a, v2df b)
> > +{
> > +    v2df result = a;
> > +    result[1] = b[1];
> > +    return result;
> > +}
> > +
> > +/* { dg-final { scan-assembler "shufpd\[\\t \]*.2, %xmm1, %xmm0" } } */
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)

Re: [PATCH] tree-optimization/94864 - vector insert of vector extract simplification

Reply via email to