Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-08-03 Thread Richard Biener via Gcc-patches
On Wed, 2 Aug 2023, Richard Sandiford wrote:

> Richard Biener  writes:
> > [...]
> >> >> in vect_determine_precisions_from_range.  Maybe we should drop
> >> >> the shift handling from there and instead rely on
> >> >> vect_determine_precisions_from_users, extending:
> >> >> 
> >> >> if (TREE_CODE (shift) != INTEGER_CST
> >> >> || !wi::ltu_p (wi::to_widest (shift), precision))
> >> >>   return;
> >> >> 
> >> >> to handle ranges where the max is known to be < precision.
> >> >> 
> >> >> There again, if masking is enough for right shifts and right rotates,
> >> >> maybe we should keep the current handling for then (with your fix)
> >> >> and skip the types_compatible_p check for those cases.
> >> >
> >> > I think it should be enough for left-shifts as well?  If we lshift
> >> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
> >> > op0 will still make us use HImode.  I think we only ever get overly
> >> > conservative answers for left-shifts from this function?
> >> 
> >> But if we have:
> >> 
> >>   short x, y;
> >>   int z = (int) x << (int) y;
> >> 
> >> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
> >> whereas x << y would invoke UB and x << (y & 15) would be 1.
> >
> > True, but we start with the range of the LHS which in this case
> > would be of type 'int' and thus 1 << 16 and not zero.  You
> > might call that a failure of vect_determine_precisions_from_range
> > of course, since it makes it not exactly a forward propagation ...
> 
> Ah, right, sorry.  I should have done more checking.
> 
> > [...]
> >> > Originally I completely disabled shift support but that regressed
> >> > the over-widen testcases a lot which at least have widened shifts
> >> > by constants a lot.
> >> >
> >> > x86 has vector rotates only for AMD XOP (which is dead) plus
> >> > some for V1TImode AFAICS, but I think we pattern-match rotates
> >> > to shifts, so maybe the precision stuff is interesting for the
> >> > case where we match the pattern rotate sequence for widenings?
> >> >
> >> > So for the types_compatible_p issue something along
> >> > the following?  We could also exempt the shift operand from
> >> > being covered by min_precision so the consumer would have
> >> > to make sure it can be represented (I think that's never going
> >> > to be an issue in practice until we get 256bit integers vectorized).
> >> > It will have to fixup the shift operands anyway.
> >> >
> >> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> >> > index e4ab8c2d65b..cdeeaf98a47 100644
> >> > --- a/gcc/tree-vect-patterns.cc
> >> > +++ b/gcc/tree-vect-patterns.cc
> >> > @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
> >> > (stmt_vec_info stmt_info, gassign *stmt)
> >> >   }
> >> > else if (TREE_CODE (op) == SSA_NAME)
> >> >   {
> >> > -   /* Ignore codes that don't take uniform arguments.  */
> >> > -   if (!types_compatible_p (TREE_TYPE (op), type))
> >> > +   /* Ignore codes that don't take uniform arguments.  For 
> >> > shifts
> >> > +  the shift amount is known to be in-range.  */
> >> 
> >> I guess it's more "we can assume that the amount is in range"?
> >
> > Yes.
> >
> >> > +   if (code == LSHIFT_EXPR
> >> > +   || code == RSHIFT_EXPR
> >> > +   || code == LROTATE_EXPR
> >> > +   || code == RROTATE_EXPR)
> >> > + {
> >> > +   min_value = wi::min (min_value, 0, sign);
> >> > +   max_value = wi::max (max_value, TYPE_PRECISION (type), 
> >> > sign);
> >> 
> >> LGTM for shifts right.  Because of the above lshift thing, I think we
> >> need something like:
> >> 
> >>   if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
> >> {
> >>   wide_int op_min_value, op_max_value;
> >>   if (!vect_get_range_info (op, _min_value, op_max_value))
> >> return;
> >> 
> >>   /* We can ignore left shifts by negative amounts, which are UB.  */
> >>   min_value = wi::min (min_value, 0, sign);
> >> 
> >>   /* Make sure the highest non-UB shift amount doesn't become UB.  */
> >>   op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
> >>   auto mask = wi::mask (TYPE_PRECISION (type), false,
> >>op_max_value.to_uhwi ());
> >>   max_value = wi::max (max_value, mask, sign);
> >> }
> >> 
> >> Does that look right?
> >
> > As said it looks overly conservative to me?  For example with my patch
> > for
> >
> > void foo (signed char *v, int s)
> > {
> >   if (s < 1 || s > 7)
> > return;
> >   for (int i = 0; i < 1024; ++i)
> > v[i] = v[i] << s;
> > }
> >
> > I get
> >
> > t.c:5:21: note:   _7 has range [0xc000, 0x3f80]
> > t.c:5:21: note:   can narrow to signed:15 without loss of precision: _7 = 
> > _6 << s_12(D);
> > t.c:5:21: note:   only the low 15 bits of _6 are significant
> > t.c:5:21: note:   _6 has range [0xff80, 0x7f]
> > ...
> > 

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-08-02 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> [...]
>> >> in vect_determine_precisions_from_range.  Maybe we should drop
>> >> the shift handling from there and instead rely on
>> >> vect_determine_precisions_from_users, extending:
>> >> 
>> >>   if (TREE_CODE (shift) != INTEGER_CST
>> >>   || !wi::ltu_p (wi::to_widest (shift), precision))
>> >> return;
>> >> 
>> >> to handle ranges where the max is known to be < precision.
>> >> 
>> >> There again, if masking is enough for right shifts and right rotates,
>> >> maybe we should keep the current handling for then (with your fix)
>> >> and skip the types_compatible_p check for those cases.
>> >
>> > I think it should be enough for left-shifts as well?  If we lshift
>> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
>> > op0 will still make us use HImode.  I think we only ever get overly
>> > conservative answers for left-shifts from this function?
>> 
>> But if we have:
>> 
>>   short x, y;
>>   int z = (int) x << (int) y;
>> 
>> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
>> whereas x << y would invoke UB and x << (y & 15) would be 1.
>
> True, but we start with the range of the LHS which in this case
> would be of type 'int' and thus 1 << 16 and not zero.  You
> might call that a failure of vect_determine_precisions_from_range
> of course, since it makes it not exactly a forward propagation ...

Ah, right, sorry.  I should have done more checking.

> [...]
>> > Originally I completely disabled shift support but that regressed
>> > the over-widen testcases a lot which at least have widened shifts
>> > by constants a lot.
>> >
>> > x86 has vector rotates only for AMD XOP (which is dead) plus
>> > some for V1TImode AFAICS, but I think we pattern-match rotates
>> > to shifts, so maybe the precision stuff is interesting for the
>> > case where we match the pattern rotate sequence for widenings?
>> >
>> > So for the types_compatible_p issue something along
>> > the following?  We could also exempt the shift operand from
>> > being covered by min_precision so the consumer would have
>> > to make sure it can be represented (I think that's never going
>> > to be an issue in practice until we get 256bit integers vectorized).
>> > It will have to fixup the shift operands anyway.
>> >
>> > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
>> > index e4ab8c2d65b..cdeeaf98a47 100644
>> > --- a/gcc/tree-vect-patterns.cc
>> > +++ b/gcc/tree-vect-patterns.cc
>> > @@ -6378,16 +6378,26 @@ vect_determine_precisions_from_range 
>> > (stmt_vec_info stmt_info, gassign *stmt)
>> >   }
>> > else if (TREE_CODE (op) == SSA_NAME)
>> >   {
>> > -   /* Ignore codes that don't take uniform arguments.  */
>> > -   if (!types_compatible_p (TREE_TYPE (op), type))
>> > +   /* Ignore codes that don't take uniform arguments.  For shifts
>> > +  the shift amount is known to be in-range.  */
>> 
>> I guess it's more "we can assume that the amount is in range"?
>
> Yes.
>
>> > +   if (code == LSHIFT_EXPR
>> > +   || code == RSHIFT_EXPR
>> > +   || code == LROTATE_EXPR
>> > +   || code == RROTATE_EXPR)
>> > + {
>> > +   min_value = wi::min (min_value, 0, sign);
>> > +   max_value = wi::max (max_value, TYPE_PRECISION (type), 
>> > sign);
>> 
>> LGTM for shifts right.  Because of the above lshift thing, I think we
>> need something like:
>> 
>>   if (code == LSHIFT_EXPR || code == LROTATE_EXPR)
>> {
>>   wide_int op_min_value, op_max_value;
>>   if (!vect_get_range_info (op, _min_value, op_max_value))
>> return;
>> 
>>   /* We can ignore left shifts by negative amounts, which are UB.  */
>>   min_value = wi::min (min_value, 0, sign);
>> 
>>   /* Make sure the highest non-UB shift amount doesn't become UB.  */
>>   op_max_value = wi::umin (op_max_value, TYPE_PRECISION (type));
>>   auto mask = wi::mask (TYPE_PRECISION (type), false,
>>  op_max_value.to_uhwi ());
>>   max_value = wi::max (max_value, mask, sign);
>> }
>> 
>> Does that look right?
>
> As said it looks overly conservative to me?  For example with my patch
> for
>
> void foo (signed char *v, int s)
> {
>   if (s < 1 || s > 7)
> return;
>   for (int i = 0; i < 1024; ++i)
> v[i] = v[i] << s;
> }
>
> I get
>
> t.c:5:21: note:   _7 has range [0xc000, 0x3f80]
> t.c:5:21: note:   can narrow to signed:15 without loss of precision: _7 = 
> _6 << s_12(D);
> t.c:5:21: note:   only the low 15 bits of _6 are significant
> t.c:5:21: note:   _6 has range [0xff80, 0x7f]
> ...
> t.c:5:21: note:   vect_recog_over_widening_pattern: detected: _7 = _6 << 
> s_12(D);
> t.c:5:21: note:   demoting int to signed short
> t.c:5:21: note:   Splitting statement: _6 = (int) _5;
> t.c:5:21: note:   into pattern statements: patt_24 = (signed short) _5;
> t.c:5:21: note:   and: patt_23 = (int) 

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-08-02 Thread Richard Biener via Gcc-patches
On Wed, 2 Aug 2023, Richard Sandiford wrote:

> Richard Biener  writes:
> > On Tue, 1 Aug 2023, Richard Sandiford wrote:
> >
> >> Richard Sandiford  writes:
> >> > Richard Biener via Gcc-patches  writes:
> >> >> The following makes sure to limit the shift operand when vectorizing
> >> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> >> >> operand otherwise invokes undefined behavior.  When we determine
> >> >> whether we can demote the operand we know we at most shift in the
> >> >> sign bit so we can adjust the shift amount.
> >> >>
> >> >> Note this has the possibility of un-CSEing common shift operands
> >> >> as there's no good way to share pattern stmts between patterns.
> >> >> We'd have to separately pattern recognize the definition.
> >> >>
> >> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >> >>
> >> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
> >> >> the fallback optimistic zero for out-of-range shifts is at least
> >> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
> >> >
> >> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
> >> > IMO the behaviour is equally wrong for LSHIFT_EXPR.
> >
> > I meant "correct" for the constant folding that evaluates out-of-bound
> > shifts as zero.
> >
> >> Sorry for the multiple messages.  Wanted to get something out quickly
> >> because I wasn't sure how long it would take me to write this...
> >> 
> >> On rotates, for:
> >> 
> >> void
> >> foo (unsigned short *restrict ptr)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >> {
> >>   unsigned int x = ptr[i] & 0xff0;
> >>   ptr[i] = (x << 1) | (x >> 31);
> >> }
> >> }
> >> 
> >> we do get:
> >> 
> >> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
> >> 
> >> although aarch64 doesn't provide rrotate patterns, so nothing actually
> >> comes of it.
> >
> > I think it's still correct that we only need unsigned:13 for the input,
> > we know other bits are zero.  But of course when actually applying
> > this as documented
> >
> > /* Record that STMT_INFO could be changed from operating on TYPE to
> >operating on a type with the precision and sign given by PRECISION
> >and SIGN respectively.
> >
> > the operation itself has to be altered (the above doesn't suggest
> > promoting/demoting the operands to TYPE is the only thing to do).
> >
> > So it seems to be the burden is on the consumers of the information?
> 
> Yeah, textually that seems fair.  Not sure I was thinking of it in
> those terms at the time though. :)
> 
> >> I think the handling of variable shifts is flawed for other reasons.  
> >> Given:
> >> 
> >> void
> >> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >> ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> void
> >> us (unsigned short *restrict ptr1, short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >> ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> void
> >> su (short *restrict ptr1, unsigned short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >> ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> void
> >> ss (short *restrict ptr1, short *restrict ptr2)
> >> {
> >>   for (int i = 0; i < 200; ++i)
> >> ptr1[i] = ptr1[i] >> ptr2[i];
> >> }
> >> 
> >> we only narrow uu and ss, due to:
> >> 
> >>/* Ignore codes that don't take uniform arguments.  */
> >>if (!types_compatible_p (TREE_TYPE (op), type))
> >>  return;
> >
> > I suppose that's because we care about the shift operand at all here.
> > We could possibly use [0 .. precision-1] as known range for it
> > and only if that doesn't fit 'type' give up (and otherwise simply
> > ignore the input range of the shift operands here).
> >
> >> in vect_determine_precisions_from_range.  Maybe we should drop
> >> the shift handling from there and instead rely on
> >> vect_determine_precisions_from_users, extending:
> >> 
> >>if (TREE_CODE (shift) != INTEGER_CST
> >>|| !wi::ltu_p (wi::to_widest (shift), precision))
> >>  return;
> >> 
> >> to handle ranges where the max is known to be < precision.
> >> 
> >> There again, if masking is enough for right shifts and right rotates,
> >> maybe we should keep the current handling for then (with your fix)
> >> and skip the types_compatible_p check for those cases.
> >
> > I think it should be enough for left-shifts as well?  If we lshift
> > out like 0x100 << 9 so the lhs range is [0,0] the input range from
> > op0 will still make us use HImode.  I think we only ever get overly
> > conservative answers for left-shifts from this function?
> 
> But if we have:
> 
>   short x, y;
>   int z = (int) x << (int) y;
> 
> and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
> whereas x << y would invoke UB and x << (y & 15) would be 1.

True, but we start with the range of the LHS which in this case
would be of type 

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-08-02 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, 1 Aug 2023, Richard Sandiford wrote:
>
>> Richard Sandiford  writes:
>> > Richard Biener via Gcc-patches  writes:
>> >> The following makes sure to limit the shift operand when vectorizing
>> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
>> >> operand otherwise invokes undefined behavior.  When we determine
>> >> whether we can demote the operand we know we at most shift in the
>> >> sign bit so we can adjust the shift amount.
>> >>
>> >> Note this has the possibility of un-CSEing common shift operands
>> >> as there's no good way to share pattern stmts between patterns.
>> >> We'd have to separately pattern recognize the definition.
>> >>
>> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>> >>
>> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
>> >> the fallback optimistic zero for out-of-range shifts is at least
>> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
>> >
>> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
>> > IMO the behaviour is equally wrong for LSHIFT_EXPR.
>
> I meant "correct" for the constant folding that evaluates out-of-bound
> shifts as zero.
>
>> Sorry for the multiple messages.  Wanted to get something out quickly
>> because I wasn't sure how long it would take me to write this...
>> 
>> On rotates, for:
>> 
>> void
>> foo (unsigned short *restrict ptr)
>> {
>>   for (int i = 0; i < 200; ++i)
>> {
>>   unsigned int x = ptr[i] & 0xff0;
>>   ptr[i] = (x << 1) | (x >> 31);
>> }
>> }
>> 
>> we do get:
>> 
>> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
>> 
>> although aarch64 doesn't provide rrotate patterns, so nothing actually
>> comes of it.
>
> I think it's still correct that we only need unsigned:13 for the input,
> we know other bits are zero.  But of course when actually applying
> this as documented
>
> /* Record that STMT_INFO could be changed from operating on TYPE to
>operating on a type with the precision and sign given by PRECISION
>and SIGN respectively.
>
> the operation itself has to be altered (the above doesn't suggest
> promoting/demoting the operands to TYPE is the only thing to do).
>
> So it seems to be the burden is on the consumers of the information?

Yeah, textually that seems fair.  Not sure I was thinking of it in
those terms at the time though. :)

>> I think the handling of variable shifts is flawed for other reasons.  Given:
>> 
>> void
>> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>> ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> us (unsigned short *restrict ptr1, short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>> ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> su (short *restrict ptr1, unsigned short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>> ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> void
>> ss (short *restrict ptr1, short *restrict ptr2)
>> {
>>   for (int i = 0; i < 200; ++i)
>> ptr1[i] = ptr1[i] >> ptr2[i];
>> }
>> 
>> we only narrow uu and ss, due to:
>> 
>>  /* Ignore codes that don't take uniform arguments.  */
>>  if (!types_compatible_p (TREE_TYPE (op), type))
>>return;
>
> I suppose that's because we care about the shift operand at all here.
> We could possibly use [0 .. precision-1] as known range for it
> and only if that doesn't fit 'type' give up (and otherwise simply
> ignore the input range of the shift operands here).
>
>> in vect_determine_precisions_from_range.  Maybe we should drop
>> the shift handling from there and instead rely on
>> vect_determine_precisions_from_users, extending:
>> 
>>  if (TREE_CODE (shift) != INTEGER_CST
>>  || !wi::ltu_p (wi::to_widest (shift), precision))
>>return;
>> 
>> to handle ranges where the max is known to be < precision.
>> 
>> There again, if masking is enough for right shifts and right rotates,
>> maybe we should keep the current handling for then (with your fix)
>> and skip the types_compatible_p check for those cases.
>
> I think it should be enough for left-shifts as well?  If we lshift
> out like 0x100 << 9 so the lhs range is [0,0] the input range from
> op0 will still make us use HImode.  I think we only ever get overly
> conservative answers for left-shifts from this function?

But if we have:

  short x, y;
  int z = (int) x << (int) y;

and at runtime, x == 1, y == 16, (short) z should be 0 (no UB),
whereas x << y would invoke UB and x << (y & 15) would be 1.

> Whatever works for RROTATE should also work for LROTATE.

I think the same problem affects LROTATE.

>> So:
>> 
>> - restrict shift handling in vect_determine_precisions_from_range to
>>   RSHIFT_EXPR and RROTATE_EXPR
>> 
>> - remove types_compatible_p restriction for those cases
>> 
>> - extend vect_determine_precisions_from_users shift handling to check
>>   for ranges on the 

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-08-02 Thread Richard Biener via Gcc-patches
On Tue, 1 Aug 2023, Richard Sandiford wrote:

> Richard Sandiford  writes:
> > Richard Biener via Gcc-patches  writes:
> >> The following makes sure to limit the shift operand when vectorizing
> >> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> >> operand otherwise invokes undefined behavior.  When we determine
> >> whether we can demote the operand we know we at most shift in the
> >> sign bit so we can adjust the shift amount.
> >>
> >> Note this has the possibility of un-CSEing common shift operands
> >> as there's no good way to share pattern stmts between patterns.
> >> We'd have to separately pattern recognize the definition.
> >>
> >> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >>
> >> Not sure about LSHIFT_EXPR, it probably has the same issue but
> >> the fallback optimistic zero for out-of-range shifts is at least
> >> "corrrect".  Not sure we ever try to demote rotates (probably not).
> >
> > I guess you mean "correct" for x86?  But that's just a quirk of x86.
> > IMO the behaviour is equally wrong for LSHIFT_EXPR.

I meant "correct" for the constant folding that evaluates out-of-bound
shifts as zero.

> Sorry for the multiple messages.  Wanted to get something out quickly
> because I wasn't sure how long it would take me to write this...
> 
> On rotates, for:
> 
> void
> foo (unsigned short *restrict ptr)
> {
>   for (int i = 0; i < 200; ++i)
> {
>   unsigned int x = ptr[i] & 0xff0;
>   ptr[i] = (x << 1) | (x >> 31);
> }
> }
> 
> we do get:
> 
> can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;
> 
> although aarch64 doesn't provide rrotate patterns, so nothing actually
> comes of it.

I think it's still correct that we only need unsigned:13 for the input,
we know other bits are zero.  But of course when actually applying
this as documented

/* Record that STMT_INFO could be changed from operating on TYPE to
   operating on a type with the precision and sign given by PRECISION
   and SIGN respectively.

the operation itself has to be altered (the above doesn't suggest
promoting/demoting the operands to TYPE is the only thing to do).

So it seems to be the burden is on the consumers of the information?

> I think the handling of variable shifts is flawed for other reasons.  Given:
> 
> void
> uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
> ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> void
> us (unsigned short *restrict ptr1, short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
> ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> void
> su (short *restrict ptr1, unsigned short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
> ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> void
> ss (short *restrict ptr1, short *restrict ptr2)
> {
>   for (int i = 0; i < 200; ++i)
> ptr1[i] = ptr1[i] >> ptr2[i];
> }
> 
> we only narrow uu and ss, due to:
> 
>   /* Ignore codes that don't take uniform arguments.  */
>   if (!types_compatible_p (TREE_TYPE (op), type))
> return;

I suppose that's because we care about the shift operand at all here.
We could possibly use [0 .. precision-1] as known range for it
and only if that doesn't fit 'type' give up (and otherwise simply
ignore the input range of the shift operands here).

> in vect_determine_precisions_from_range.  Maybe we should drop
> the shift handling from there and instead rely on
> vect_determine_precisions_from_users, extending:
> 
>   if (TREE_CODE (shift) != INTEGER_CST
>   || !wi::ltu_p (wi::to_widest (shift), precision))
> return;
> 
> to handle ranges where the max is known to be < precision.
> 
> There again, if masking is enough for right shifts and right rotates,
> maybe we should keep the current handling for then (with your fix)
> and skip the types_compatible_p check for those cases.

I think it should be enough for left-shifts as well?  If we lshift
out like 0x100 << 9 so the lhs range is [0,0] the input range from
op0 will still make us use HImode.  I think we only ever get overly
conservative answers for left-shifts from this function?

Whatever works for RROTATE should also work for LROTATE.

> So:
> 
> - restrict shift handling in vect_determine_precisions_from_range to
>   RSHIFT_EXPR and RROTATE_EXPR
> 
> - remove types_compatible_p restriction for those cases
> 
> - extend vect_determine_precisions_from_users shift handling to check
>   for ranges on the shift amount
> 
> Does that sound right?

I'm not sure.   This all felt somewhat fragile when looking closer
(I was hoping you would eventually tackle it from the older
referenced bug) ... so my main idea was to perform incremental changes
where I have test coverage (as with the new wrong-code bug).

Originally I completely disabled shift support but that regressed
the over-widen testcases a lot which at least have widened shifts
by constants a lot.

x86 has vector rotates only for AMD XOP (which is 

Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-08-01 Thread Richard Sandiford via Gcc-patches
Richard Sandiford  writes:
> Richard Biener via Gcc-patches  writes:
>> The following makes sure to limit the shift operand when vectorizing
>> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
>> operand otherwise invokes undefined behavior.  When we determine
>> whether we can demote the operand we know we at most shift in the
>> sign bit so we can adjust the shift amount.
>>
>> Note this has the possibility of un-CSEing common shift operands
>> as there's no good way to share pattern stmts between patterns.
>> We'd have to separately pattern recognize the definition.
>>
>> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>>
>> Not sure about LSHIFT_EXPR, it probably has the same issue but
>> the fallback optimistic zero for out-of-range shifts is at least
>> "corrrect".  Not sure we ever try to demote rotates (probably not).
>
> I guess you mean "correct" for x86?  But that's just a quirk of x86.
> IMO the behaviour is equally wrong for LSHIFT_EXPR.

Sorry for the multiple messages.  Wanted to get something out quickly
because I wasn't sure how long it would take me to write this...

On rotates, for:

void
foo (unsigned short *restrict ptr)
{
  for (int i = 0; i < 200; ++i)
{
  unsigned int x = ptr[i] & 0xff0;
  ptr[i] = (x << 1) | (x >> 31);
}
}

we do get:

can narrow to unsigned:13 without loss of precision: _5 = x_12 r>> 31;

although aarch64 doesn't provide rrotate patterns, so nothing actually
comes of it.

I think the handling of variable shifts is flawed for other reasons.  Given:

void
uu (unsigned short *restrict ptr1, unsigned short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
ptr1[i] = ptr1[i] >> ptr2[i];
}

void
us (unsigned short *restrict ptr1, short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
ptr1[i] = ptr1[i] >> ptr2[i];
}

void
su (short *restrict ptr1, unsigned short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
ptr1[i] = ptr1[i] >> ptr2[i];
}

void
ss (short *restrict ptr1, short *restrict ptr2)
{
  for (int i = 0; i < 200; ++i)
ptr1[i] = ptr1[i] >> ptr2[i];
}

we only narrow uu and ss, due to:

/* Ignore codes that don't take uniform arguments.  */
if (!types_compatible_p (TREE_TYPE (op), type))
  return;

in vect_determine_precisions_from_range.  Maybe we should drop
the shift handling from there and instead rely on
vect_determine_precisions_from_users, extending:

if (TREE_CODE (shift) != INTEGER_CST
|| !wi::ltu_p (wi::to_widest (shift), precision))
  return;

to handle ranges where the max is known to be < precision.

There again, if masking is enough for right shifts and right rotates,
maybe we should keep the current handling for then (with your fix)
and skip the types_compatible_p check for those cases.

So:

- restrict shift handling in vect_determine_precisions_from_range to
  RSHIFT_EXPR and RROTATE_EXPR

- remove types_compatible_p restriction for those cases

- extend vect_determine_precisions_from_users shift handling to check
  for ranges on the shift amount

Does that sound right?

Thanks,
Richard


Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-08-01 Thread Richard Sandiford via Gcc-patches
Richard Biener via Gcc-patches  writes:
> The following makes sure to limit the shift operand when vectorizing
> (short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
> operand otherwise invokes undefined behavior.  When we determine
> whether we can demote the operand we know we at most shift in the
> sign bit so we can adjust the shift amount.
>
> Note this has the possibility of un-CSEing common shift operands
> as there's no good way to share pattern stmts between patterns.
> We'd have to separately pattern recognize the definition.
>
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>
> Not sure about LSHIFT_EXPR, it probably has the same issue but
> the fallback optimistic zero for out-of-range shifts is at least
> "corrrect".  Not sure we ever try to demote rotates (probably not).

I guess you mean "correct" for x86?  But that's just a quirk of x86.
IMO the behaviour is equally wrong for LSHIFT_EXPR.

Richard

> OK?
>
> Thanks,
> Richard.
>
>   PR tree-optimization/110838
>   * tree-vect-patterns.cc (vect_recog_over_widening_pattern):
>   Adjust the shift operand of RSHIFT_EXPRs.
>
>   * gcc.dg/torture/pr110838.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/torture/pr110838.c | 43 +
>  gcc/tree-vect-patterns.cc   | 24 ++
>  2 files changed, 67 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr110838.c
>
> diff --git a/gcc/testsuite/gcc.dg/torture/pr110838.c 
> b/gcc/testsuite/gcc.dg/torture/pr110838.c
> new file mode 100644
> index 000..f039bd6c8ea
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr110838.c
> @@ -0,0 +1,43 @@
> +/* { dg-do run } */
> +
> +typedef __UINT32_TYPE__ uint32_t;
> +typedef __UINT8_TYPE__ uint8_t;
> +typedef __INT8_TYPE__ int8_t;
> +typedef uint8_t pixel;
> +
> +/* get the sign of input variable (TODO: this is a dup, make common) */
> +static inline int8_t signOf(int x)
> +{
> +  return (x >> 31) | ((int)uint32_t)-x)) >> 31));
> +}
> +
> +__attribute__((noipa))
> +static void calSign_bug(int8_t *dst, const pixel *src1, const pixel *src2, 
> const int endX)
> +{
> +  for (int x = 0; x < endX; x++)
> +dst[x] = signOf(src1[x] - src2[x]);
> +}
> +
> +__attribute__((noipa, optimize(0)))
> +static void calSign_ok(int8_t *dst, const pixel *src1, const pixel *src2, 
> const int endX)
> +{
> +  for (int x = 0; x < endX; x++)
> +dst[x] = signOf(src1[x] - src2[x]);
> +}
> +
> +__attribute__((noipa, optimize(0)))
> +int main()
> +{
> +  const pixel s1[9] = { 0xcd, 0x33, 0xd4, 0x3e, 0xb0, 0xfb, 0x95, 0x64, 
> 0x70, };
> +  const pixel s2[9] = { 0xba, 0x9f, 0xab, 0xa1, 0x3b, 0x29, 0xb1, 0xbd, 
> 0x64, };
> +  int endX = 9;
> +  int8_t dst[9];
> +  int8_t dst_ok[9];
> +
> +  calSign_bug(dst, s1, s2, endX);
> +  calSign_ok(dst_ok, s1, s2, endX);
> +
> +  if (__builtin_memcmp(dst, dst_ok, endX) != 0)
> +__builtin_abort ();
> +  return 0;
> +}
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index ef806e2346e..e4ab8c2d65b 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -3099,9 +3099,33 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
>tree ops[3] = {};
>for (unsigned int i = 1; i < first_op; ++i)
>  ops[i - 1] = gimple_op (last_stmt, i);
> +  /* For right shifts limit the shift operand.  */
>vect_convert_inputs (vinfo, last_stmt_info, nops, [first_op - 1],
>  op_type, [0], op_vectype);
>  
> +  /* Limit shift operands.  */
> +  if (code == RSHIFT_EXPR)
> +{
> +  wide_int min_value, max_value;
> +  if (TREE_CODE (ops[1]) == INTEGER_CST)
> + ops[1] = wide_int_to_tree (op_type,
> +wi::bit_and (wi::to_wide (ops[1]),
> + new_precision - 1));
> +  else if (!vect_get_range_info (ops[1], _value, _value)
> +|| wi::ge_p (max_value, new_precision, TYPE_SIGN (op_type)))
> + {
> +   /* ???  Note the following bad for SLP as that only supports
> +  same argument widened shifts and it un-CSEs same arguments.  */
> +   tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
> +   gimple *pattern_stmt
> + = gimple_build_assign (new_var, BIT_AND_EXPR, ops[1],
> +build_int_cst (op_type, new_precision - 1));
> +   ops[1] = new_var;
> +   gimple_set_location (pattern_stmt, gimple_location (last_stmt));
> +   append_pattern_def_seq (vinfo, last_stmt_info, pattern_stmt);
> + }
> +}
> +
>/* Use the operation to produce a result of type OP_TYPE.  */
>tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
>gimple *pattern_stmt = gimple_build_assign (new_var, code,


Re: [PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-07-31 Thread Jeff Law via Gcc-patches




On 7/31/23 08:01, Richard Biener via Gcc-patches wrote:

The following makes sure to limit the shift operand when vectorizing
(short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
operand otherwise invokes undefined behavior.  When we determine
whether we can demote the operand we know we at most shift in the
sign bit so we can adjust the shift amount.

Note this has the possibility of un-CSEing common shift operands
as there's no good way to share pattern stmts between patterns.
We'd have to separately pattern recognize the definition.

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

Not sure about LSHIFT_EXPR, it probably has the same issue but
the fallback optimistic zero for out-of-range shifts is at least
"corrrect".  Not sure we ever try to demote rotates (probably not).

OK?

Thanks,
Richard.

PR tree-optimization/110838
* tree-vect-patterns.cc (vect_recog_over_widening_pattern):
Adjust the shift operand of RSHIFT_EXPRs.

* gcc.dg/torture/pr110838.c: New testcase.
I'm not a fan of the asymmetric handling across RSHIFT/LSHIFT.  But if 
you think the asymmetry isn't a problem in practice, then I won't object.


Jeff


[PATCH] tree-optimization/110838 - vectorization of widened shifts

2023-07-31 Thread Richard Biener via Gcc-patches
The following makes sure to limit the shift operand when vectorizing
(short)((int)x >> 31) via (short)x >> 31 as the out of bounds shift
operand otherwise invokes undefined behavior.  When we determine
whether we can demote the operand we know we at most shift in the
sign bit so we can adjust the shift amount.

Note this has the possibility of un-CSEing common shift operands
as there's no good way to share pattern stmts between patterns.
We'd have to separately pattern recognize the definition.

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

Not sure about LSHIFT_EXPR, it probably has the same issue but
the fallback optimistic zero for out-of-range shifts is at least
"corrrect".  Not sure we ever try to demote rotates (probably not).

OK?

Thanks,
Richard.

PR tree-optimization/110838
* tree-vect-patterns.cc (vect_recog_over_widening_pattern):
Adjust the shift operand of RSHIFT_EXPRs.

* gcc.dg/torture/pr110838.c: New testcase.
---
 gcc/testsuite/gcc.dg/torture/pr110838.c | 43 +
 gcc/tree-vect-patterns.cc   | 24 ++
 2 files changed, 67 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr110838.c

diff --git a/gcc/testsuite/gcc.dg/torture/pr110838.c 
b/gcc/testsuite/gcc.dg/torture/pr110838.c
new file mode 100644
index 000..f039bd6c8ea
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr110838.c
@@ -0,0 +1,43 @@
+/* { dg-do run } */
+
+typedef __UINT32_TYPE__ uint32_t;
+typedef __UINT8_TYPE__ uint8_t;
+typedef __INT8_TYPE__ int8_t;
+typedef uint8_t pixel;
+
+/* get the sign of input variable (TODO: this is a dup, make common) */
+static inline int8_t signOf(int x)
+{
+  return (x >> 31) | ((int)uint32_t)-x)) >> 31));
+}
+
+__attribute__((noipa))
+static void calSign_bug(int8_t *dst, const pixel *src1, const pixel *src2, 
const int endX)
+{
+  for (int x = 0; x < endX; x++)
+dst[x] = signOf(src1[x] - src2[x]);
+}
+
+__attribute__((noipa, optimize(0)))
+static void calSign_ok(int8_t *dst, const pixel *src1, const pixel *src2, 
const int endX)
+{
+  for (int x = 0; x < endX; x++)
+dst[x] = signOf(src1[x] - src2[x]);
+}
+
+__attribute__((noipa, optimize(0)))
+int main()
+{
+  const pixel s1[9] = { 0xcd, 0x33, 0xd4, 0x3e, 0xb0, 0xfb, 0x95, 0x64, 0x70, 
};
+  const pixel s2[9] = { 0xba, 0x9f, 0xab, 0xa1, 0x3b, 0x29, 0xb1, 0xbd, 0x64, 
};
+  int endX = 9;
+  int8_t dst[9];
+  int8_t dst_ok[9];
+
+  calSign_bug(dst, s1, s2, endX);
+  calSign_ok(dst_ok, s1, s2, endX);
+
+  if (__builtin_memcmp(dst, dst_ok, endX) != 0)
+__builtin_abort ();
+  return 0;
+}
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index ef806e2346e..e4ab8c2d65b 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -3099,9 +3099,33 @@ vect_recog_over_widening_pattern (vec_info *vinfo,
   tree ops[3] = {};
   for (unsigned int i = 1; i < first_op; ++i)
 ops[i - 1] = gimple_op (last_stmt, i);
+  /* For right shifts limit the shift operand.  */
   vect_convert_inputs (vinfo, last_stmt_info, nops, [first_op - 1],
   op_type, [0], op_vectype);
 
+  /* Limit shift operands.  */
+  if (code == RSHIFT_EXPR)
+{
+  wide_int min_value, max_value;
+  if (TREE_CODE (ops[1]) == INTEGER_CST)
+   ops[1] = wide_int_to_tree (op_type,
+  wi::bit_and (wi::to_wide (ops[1]),
+   new_precision - 1));
+  else if (!vect_get_range_info (ops[1], _value, _value)
+  || wi::ge_p (max_value, new_precision, TYPE_SIGN (op_type)))
+   {
+ /* ???  Note the following bad for SLP as that only supports
+same argument widened shifts and it un-CSEs same arguments.  */
+ tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
+ gimple *pattern_stmt
+   = gimple_build_assign (new_var, BIT_AND_EXPR, ops[1],
+  build_int_cst (op_type, new_precision - 1));
+ ops[1] = new_var;
+ gimple_set_location (pattern_stmt, gimple_location (last_stmt));
+ append_pattern_def_seq (vinfo, last_stmt_info, pattern_stmt);
+   }
+}
+
   /* Use the operation to produce a result of type OP_TYPE.  */
   tree new_var = vect_recog_temp_ssa_var (op_type, NULL);
   gimple *pattern_stmt = gimple_build_assign (new_var, code,
-- 
2.35.3