Re: [PATCH] Fold truncations of left shifts in match.pd

2022-06-02 Thread Richard Biener via Gcc-patches
On Thu, Jun 2, 2022 at 12:55 PM Roger Sayle  wrote:
>
>
> Hi Richard,
> > +  /* RTL expansion knows how to expand rotates using shift/or.  */  if
> > + (icode == CODE_FOR_nothing
> > +  && (code == LROTATE_EXPR || code == RROTATE_EXPR)
> > +  && optab_handler (ior_optab, vec_mode) != CODE_FOR_nothing
> > +  && optab_handler (ashl_optab, vec_mode) != CODE_FOR_nothing)
> > +icode = (int) optab_handler (lshr_optab, vec_mode);
> >
> > but we then get the vector costing wrong.
>
> The issue is that we currently get the (relative) vector costing wrong.
> Currently for gcc.dg/vect/pr98674.c, the vectorizer thinks the scalar
> code requires two shifts and an ior, so believes its profitable to vectorize
> this loop using two vector shifts and an vector ior.  But once match.pd
> simplifies the truncate and recognizes the HImode rotate we end up with:
>
> pr98674.c:6:16: note:   ==> examining statement: _6 = _1 r>> 8;
> pr98674.c:6:16: note:   vect_is_simple_use: vectype vector(8) short int
> pr98674.c:6:16: note:   vect_is_simple_use: operand 8, type of def: constant
> pr98674.c:6:16: missed:   op not supported by target.
> pr98674.c:8:33: missed:   not vectorized: relevant stmt not supported: _6 = 
> _1 r>> 8;
> pr98674.c:6:16: missed:  bad operation or unsupported loop bound.
>
>
> Clearly, it's a win to vectorize HImode rotates, when the backend can perform
> 8 (or 16) rotations at a time, but using 3 vector instructions, even when a 
> scalar
> rotate can performed in a single instruction.  Fundamentally, vectorization 
> may
> still be desirable/profitable even when the backend doesn't provide an optab.

Yes, as said it's tree-vect-patterns.cc job to handle this not
natively supported
rotate by re-writing it.  Can you check why vect_recog_rotate_pattern does
not do this?  Ah, the code only handles !TYPE_UNSIGNED (type) - not sure
why though (for rotates it should not matter and for the lowered sequence
we can convert to desired signedness to get arithmetic/logical shifts)?

> The current situation where the i386's backend provides expanders to lower
> rotations (or vcond) into individual instruction sequences, also interferes 
> with
> vector costing.   It's the vector cost function that needs to be fixed, not 
> the
> generated code made worse (or the backend bloated performing its own
> RTL expansion workarounds).
>
> Is it instead ok to mark pr98674.c as XFAIL (a regression)?
> The tweak to tree-vect-stmts.cc was based on the assumption that we wished
> to continue vectorizing this loop.  Improving scalar code generation really
> shouldn't disable vectorization like this.

Yes, see above where the fix needs to be.  The pattern will then expose
the shift and ior to the vectorizer which then are properly costed.

Richard.

>
>
> Cheers,
> Roger
> --
>
>


RE: [PATCH] Fold truncations of left shifts in match.pd

2022-06-02 Thread Roger Sayle


Hi Richard,
> +  /* RTL expansion knows how to expand rotates using shift/or.  */  if
> + (icode == CODE_FOR_nothing
> +  && (code == LROTATE_EXPR || code == RROTATE_EXPR)
> +  && optab_handler (ior_optab, vec_mode) != CODE_FOR_nothing
> +  && optab_handler (ashl_optab, vec_mode) != CODE_FOR_nothing)
> +icode = (int) optab_handler (lshr_optab, vec_mode);
> 
> but we then get the vector costing wrong. 

The issue is that we currently get the (relative) vector costing wrong.
Currently for gcc.dg/vect/pr98674.c, the vectorizer thinks the scalar
code requires two shifts and an ior, so believes its profitable to vectorize
this loop using two vector shifts and an vector ior.  But once match.pd
simplifies the truncate and recognizes the HImode rotate we end up with:

pr98674.c:6:16: note:   ==> examining statement: _6 = _1 r>> 8;
pr98674.c:6:16: note:   vect_is_simple_use: vectype vector(8) short int
pr98674.c:6:16: note:   vect_is_simple_use: operand 8, type of def: constant
pr98674.c:6:16: missed:   op not supported by target.
pr98674.c:8:33: missed:   not vectorized: relevant stmt not supported: _6 = _1 
r>> 8;
pr98674.c:6:16: missed:  bad operation or unsupported loop bound.


Clearly, it's a win to vectorize HImode rotates, when the backend can perform
8 (or 16) rotations at a time, but using 3 vector instructions, even when a 
scalar
rotate can performed in a single instruction.  Fundamentally, vectorization may
still be desirable/profitable even when the backend doesn't provide an optab.

The current situation where the i386's backend provides expanders to lower
rotations (or vcond) into individual instruction sequences, also interferes with
vector costing.   It's the vector cost function that needs to be fixed, not the
generated code made worse (or the backend bloated performing its own
RTL expansion workarounds).

Is it instead ok to mark pr98674.c as XFAIL (a regression)?
The tweak to tree-vect-stmts.cc was based on the assumption that we wished
to continue vectorizing this loop.  Improving scalar code generation really
shouldn't disable vectorization like this.


Cheers,
Roger
--




Re: [PATCH] Fold truncations of left shifts in match.pd

2022-06-02 Thread Richard Biener via Gcc-patches
On Mon, May 30, 2022 at 2:24 PM Roger Sayle  wrote:
>
>
> Whilst investigating PR 55278, I noticed that the tree-ssa optimizers
> aren't eliminating the promotions of shifts to "int" as inserted by the
> c-family front-ends, instead leaving this simplification to be left to
> the RTL optimizers.  This patch allows match.pd to do this itself earlier,
> narrowing (T)(X << C) to (T)X << C when the constant C is known to be
> valid for the (narrower) type T.
>
> Hence for this simple test case:
> short foo(short x) { return x << 5; }
>
> the .optimized dump currently looks like:
>
> short int foo (short int x)
> {
>   int _1;
>   int _2;
>   short int _4;
>
>[local count: 1073741824]:
>   _1 = (int) x_3(D);
>   _2 = _1 << 5;
>   _4 = (short int) _2;
>   return _4;
> }
>
> but with this patch, now becomes:
>
> short int foo (short int x)
> {
>   short int _2;
>
>[local count: 1073741824]:
>   _2 = x_1(D) << 5;
>   return _2;
> }
>
> This is always reasonable as RTL expansion knows how to use
> widening optabs if it makes sense at the RTL level to perform
> this shift in a wider mode.
>
> Of course, there's often a catch.  The above simplification not only
> reduces the number of statements in gimple, but also allows further
> optimizations, for example including the perception of rotate idioms
> and bswap16.  Alas, optimizing things earlier than anticipated
> requires several testsuite changes [though all these tests have
> been confirmed to generate identical assembly code on x86_64].
> The only significant change is that the vectorization pass previously
> wouldn't vectorize rotations if the backend doesn't explicitly provide
> an optab for them.  This is curious as if the rotate is expressed as
> ior(lshift,rshift) it will vectorize, and likewise RTL expansion will
> generate the iorv(lshiftv,rshiftv) sequence if required for a vector
> mode rotation.  Hence this patch includes a tweak to the optabs
> test in tree-vect-stmts.cc's vectorizable_shifts to better reflect
> the functionality supported by RTL expansion.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  Ok for mainline?

can the lshift pattern be merged with the preceeding one?  It looks
awfully similar.  Possibly even do

  (if (wi::ltu_p ())
   (lshift (convert @1) (convert @2))
   { build_zero_cst (type); })

for when the truncation leaves us with zero?

+  /* RTL expansion knows how to expand rotates using shift/or.  */
+  if (icode == CODE_FOR_nothing
+  && (code == LROTATE_EXPR || code == RROTATE_EXPR)
+  && optab_handler (ior_optab, vec_mode) != CODE_FOR_nothing
+  && optab_handler (ashl_optab, vec_mode) != CODE_FOR_nothing)
+icode = (int) optab_handler (lshr_optab, vec_mode);

but we then get the vector costing wrong.  Also note that vector lowering
will figure the rotate is not supported and do its own "lowering" using
IOR.  Also it seems that only handles the case of vector by scalar (aka
uniform vector) rotates, otherwise will expand to scalar operations.

That said, the appropriate way to deal with this is in tree-vect-patterns.cc
where there already is vect_recog_rotate_pattern that should be detected
so the above hunk shouldn't be necessary - instead eventually the
pattern recognition routine needs improving?

Thanks,
Richard.


>
> 2022-05-30  Roger Sayle  
>
> gcc/ChangeLog
> * match.pd (convert (lshift @1 INTEGER_CST@2)): Narrow integer
> left shifts by a constant when the result is truncated, and the
> shift constant is well-defined for the narrower mode.
> * tree-vect-stmts.cc (vectorizable_shift): Rotations by
> constants are vectorizable, if the backend supports logical
> shifts and IOR logical operations in the required vector mode.
>
> gcc/testsuite/ChangeLog
> * gcc.dg/fold-convlshift-4.c: New test case.
> * gcc.dg/optimize-bswaphi-1.c: Update found bswap count.
> * gcc.dg/tree-ssa/pr61839_3.c: Shift is now optimized before VRP.
> * gcc.dg/vect/vect-over-widen-1-big-array.c: Remove obsolete tests.
> * gcc.dg/vect/vect-over-widen-1.c: Likewise.
> * gcc.dg/vect/vect-over-widen-3-big-array.c: Likewise.
> * gcc.dg/vect/vect-over-widen-3.c: Likewise.
> * gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise.
> * gcc.dg/vect/vect-over-widen-4.c: Likewise.
>
>
> Thanks in advance,
> Roger
> --
>


Re: [PATCH] Fold truncations of left shifts in match.pd

2022-06-01 Thread Jeff Law via Gcc-patches




On 5/30/2022 6:23 AM, Roger Sayle wrote:

Whilst investigating PR 55278, I noticed that the tree-ssa optimizers
aren't eliminating the promotions of shifts to "int" as inserted by the
c-family front-ends, instead leaving this simplification to be left to
the RTL optimizers.  This patch allows match.pd to do this itself earlier,
narrowing (T)(X << C) to (T)X << C when the constant C is known to be
valid for the (narrower) type T.

Hence for this simple test case:
short foo(short x) { return x << 5; }

the .optimized dump currently looks like:

short int foo (short int x)
{
   int _1;
   int _2;
   short int _4;

[local count: 1073741824]:
   _1 = (int) x_3(D);
   _2 = _1 << 5;
   _4 = (short int) _2;
   return _4;
}

but with this patch, now becomes:

short int foo (short int x)
{
   short int _2;

[local count: 1073741824]:
   _2 = x_1(D) << 5;
   return _2;
}

This is always reasonable as RTL expansion knows how to use
widening optabs if it makes sense at the RTL level to perform
this shift in a wider mode.

Of course, there's often a catch.  The above simplification not only
reduces the number of statements in gimple, but also allows further
optimizations, for example including the perception of rotate idioms
and bswap16.  Alas, optimizing things earlier than anticipated
requires several testsuite changes [though all these tests have
been confirmed to generate identical assembly code on x86_64].
The only significant change is that the vectorization pass previously
wouldn't vectorize rotations if the backend doesn't explicitly provide
an optab for them.  This is curious as if the rotate is expressed as
ior(lshift,rshift) it will vectorize, and likewise RTL expansion will
generate the iorv(lshiftv,rshiftv) sequence if required for a vector
mode rotation.  Hence this patch includes a tweak to the optabs
test in tree-vect-stmts.cc's vectorizable_shifts to better reflect
the functionality supported by RTL expansion.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-05-30  Roger Sayle  

gcc/ChangeLog
 * match.pd (convert (lshift @1 INTEGER_CST@2)): Narrow integer
 left shifts by a constant when the result is truncated, and the
 shift constant is well-defined for the narrower mode.
 * tree-vect-stmts.cc (vectorizable_shift): Rotations by
 constants are vectorizable, if the backend supports logical
 shifts and IOR logical operations in the required vector mode.

gcc/testsuite/ChangeLog
 * gcc.dg/fold-convlshift-4.c: New test case.
 * gcc.dg/optimize-bswaphi-1.c: Update found bswap count.
 * gcc.dg/tree-ssa/pr61839_3.c: Shift is now optimized before VRP.
 * gcc.dg/vect/vect-over-widen-1-big-array.c: Remove obsolete tests.
 * gcc.dg/vect/vect-over-widen-1.c: Likewise.
 * gcc.dg/vect/vect-over-widen-3-big-array.c: Likewise.
 * gcc.dg/vect/vect-over-widen-3.c: Likewise.
 * gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise.
 * gcc.dg/vect/vect-over-widen-4.c: Likewise.
So the worry here would be stuff like narrowing the source operand 
leading to partial stalls.  But as you indicated, if the target really 
wants to do the shift in a wider mode, it can.  Furthermore, the place 
to make that decision is at the gimple->rtl border, IMHO.


OK.

jeff

ps.  There may still be another old BZ for the lack of narrowing 
inhibiting vectorization IIRC.  I don't recall the specifics enough to 
hazard a guess if this patch will help or not.


[PATCH] Fold truncations of left shifts in match.pd

2022-05-30 Thread Roger Sayle

Whilst investigating PR 55278, I noticed that the tree-ssa optimizers
aren't eliminating the promotions of shifts to "int" as inserted by the
c-family front-ends, instead leaving this simplification to be left to
the RTL optimizers.  This patch allows match.pd to do this itself earlier,
narrowing (T)(X << C) to (T)X << C when the constant C is known to be
valid for the (narrower) type T.

Hence for this simple test case:
short foo(short x) { return x << 5; }

the .optimized dump currently looks like:

short int foo (short int x)
{
  int _1;
  int _2;
  short int _4;

   [local count: 1073741824]:
  _1 = (int) x_3(D);
  _2 = _1 << 5;
  _4 = (short int) _2;
  return _4;
}

but with this patch, now becomes:

short int foo (short int x)
{
  short int _2;

   [local count: 1073741824]:
  _2 = x_1(D) << 5;
  return _2;
}

This is always reasonable as RTL expansion knows how to use
widening optabs if it makes sense at the RTL level to perform
this shift in a wider mode.

Of course, there's often a catch.  The above simplification not only
reduces the number of statements in gimple, but also allows further
optimizations, for example including the perception of rotate idioms
and bswap16.  Alas, optimizing things earlier than anticipated
requires several testsuite changes [though all these tests have
been confirmed to generate identical assembly code on x86_64].
The only significant change is that the vectorization pass previously
wouldn't vectorize rotations if the backend doesn't explicitly provide
an optab for them.  This is curious as if the rotate is expressed as
ior(lshift,rshift) it will vectorize, and likewise RTL expansion will
generate the iorv(lshiftv,rshiftv) sequence if required for a vector
mode rotation.  Hence this patch includes a tweak to the optabs
test in tree-vect-stmts.cc's vectorizable_shifts to better reflect
the functionality supported by RTL expansion.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32},
with no new failures.  Ok for mainline?


2022-05-30  Roger Sayle  

gcc/ChangeLog
* match.pd (convert (lshift @1 INTEGER_CST@2)): Narrow integer
left shifts by a constant when the result is truncated, and the
shift constant is well-defined for the narrower mode.
* tree-vect-stmts.cc (vectorizable_shift): Rotations by
constants are vectorizable, if the backend supports logical
shifts and IOR logical operations in the required vector mode.

gcc/testsuite/ChangeLog
* gcc.dg/fold-convlshift-4.c: New test case.
* gcc.dg/optimize-bswaphi-1.c: Update found bswap count.
* gcc.dg/tree-ssa/pr61839_3.c: Shift is now optimized before VRP.
* gcc.dg/vect/vect-over-widen-1-big-array.c: Remove obsolete tests.
* gcc.dg/vect/vect-over-widen-1.c: Likewise.
* gcc.dg/vect/vect-over-widen-3-big-array.c: Likewise.
* gcc.dg/vect/vect-over-widen-3.c: Likewise.
* gcc.dg/vect/vect-over-widen-4-big-array.c: Likewise.
* gcc.dg/vect/vect-over-widen-4.c: Likewise.


Thanks in advance,
Roger
--

diff --git a/gcc/match.pd b/gcc/match.pd
index c2fed9b..05b34d6 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3637,6 +3637,16 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && !integer_zerop (@3))
   (lshift (convert @2) @3)))
 
+/* Narrow a lshift by constant.  */
+(simplify
+ (convert (lshift:s@0 @1 INTEGER_CST@2))
+ (if (INTEGRAL_TYPE_P (type)
+  && INTEGRAL_TYPE_P (TREE_TYPE (@0))
+  && TYPE_PRECISION (type) <= TYPE_PRECISION (TREE_TYPE (@0))
+  && !integer_zerop (@2)
+  && wi::ltu_p (wi::to_wide (@2), TYPE_PRECISION (type)))
+  (lshift (convert @1) (convert @2
+
 /* Simplifications of conversions.  */
 
 /* Basic strip-useless-type-conversions / strip_nops.  */
diff --git a/gcc/testsuite/gcc.dg/fold-convlshift-4.c 
b/gcc/testsuite/gcc.dg/fold-convlshift-4.c
new file mode 100644
index 000..001627f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/fold-convlshift-4.c
@@ -0,0 +1,9 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+short foo(short x)
+{
+  return x << 5;
+}
+
+/* { dg-final { scan-tree-dump-not "\\(int\\)" "optimized" } } */
+/* { dg-final { scan-tree-dump-not "\\(short int\\)" "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c 
b/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
index d045da9..a5d8bfd 100644
--- a/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
+++ b/gcc/testsuite/gcc.dg/optimize-bswaphi-1.c
@@ -68,4 +68,4 @@ get_unaligned_16_be (unsigned char *p)
 
 
 /* { dg-final { scan-tree-dump-times "16 bit load in target endianness found 
at" 4 "bswap" } } */
-/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 5 
"bswap" } } */
+/* { dg-final { scan-tree-dump-times "16 bit bswap implementation found at" 4 
"bswap" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c 
b/gcc/testsuite/gcc.dg/tree-ssa/pr61839_3.c