Re: [RFC] ldist: Recognize rawmemchr loop patterns

2022-01-31 Thread Richard Biener via Gcc-patches
On Mon, Jan 31, 2022 at 2:16 PM Tom de Vries  wrote:
>
> On 9/17/21 10:08, Richard Biener via Gcc-patches wrote:
> > On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
> >  wrote:
> >>
> >> On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> >>> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> >>>  wrote:
> 
>  On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
>  [...]
> >>>
> >>> +  /* Handle strlen like loops.  */
> >>> +  if (store_dr == NULL
> >>> +  && integer_zerop (pattern)
> >>> +  && TREE_CODE (reduction_iv.base) == INTEGER_CST
> >>> +  && TREE_CODE (reduction_iv.step) == INTEGER_CST
> >>> +  && integer_onep (reduction_iv.step)
> >>> +  && (types_compatible_p (TREE_TYPE (reduction_var), 
> >>> size_type_node)
> >>> + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
> >>> +{
> >>>
> >>> I wonder what goes wrong with a larger or smaller wrapping IV type?
> >>> The iteration
> >>> only stops when you load a NUL and the increments just wrap along 
> >>> (you're
> >>> using the pointer IVs to compute the strlen result).  Can't you 
> >>> simply truncate?
> >>
> >> I think truncation is enough as long as no overflow occurs in strlen or
> >> strlen_using_rawmemchr.
> >>
> >>> For larger than size_type_node (actually larger than ptr_type_node 
> >>> would matter
> >>> I guess), the argument is that since pointer wrapping would be 
> >>> undefined anyway
> >>> the IV cannot wrap either.  Now, the correct check here would IMHO be
> >>>
> >>>TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> >>> (ptr_type_node)
> >>> || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> >>>
> >>> ?
> >>
> >> Regarding the implementation which makes use of rawmemchr:
> >>
> >> We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> >> the maximal length we can determine of a string where each character 
> >> has
> >> size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> >> ptrdiff type is undefined we have to make sure that if an overflow
> >> occurs, then an overflow occurs for reduction variable, too, and that
> >> this is undefined, too.  However, I'm not sure anymore whether we want
> >> to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> >> equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> >> this would mean that a single string consumes more than half of the
> >> virtual addressable memory.  At least for architectures where
> >> TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is 
> >> reasonable
> >> to neglect the case where computing pointer difference may overflow.
> >> Otherwise we are talking about strings with lenghts of multiple
> >> pebibytes.  For other architectures we might have to be more precise
> >> and make sure that reduction variable overflows first and that this is
> >> undefined.
> >>
> >> Thus a conservative condition would be (I assumed that the size of any
> >> integral type is a power of two which I'm not sure if this really 
> >> holds;
> >> IIRC the C standard requires only that the alignment is a power of two
> >> but not necessarily the size so I might need to change this):
> >>
> >> /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - 
> >> log2 (sizeof (load_type))
> >> or in other words return true if reduction variable overflows first
> >> and false otherwise.  */
> >>
> >> static bool
> >> reduction_var_overflows_first (tree reduction_var, tree load_type)
> >> {
> >>unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> >>unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> >> (reduction_var));
> >>unsigned size_exponent = wi::exact_log2 (wi::to_wide 
> >> (TYPE_SIZE_UNIT (load_type)));
> >>return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
> >> size_exponent);
> >> }
> >>
> >> TYPE_PRECISION (ptrdiff_type_node) == 64
> >> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>  && reduction_var_overflows_first (reduction_var, load_type)
> >>
> >> Regarding the implementation which makes use of strlen:
> >>
> >> I'm not sure what it means if strlen is called for a string with a
> >> length greater than SIZE_MAX.  Therefore, similar to the implementation
> >> using rawmemchr where we neglect the case of an overflow for 64bit
> >> architectures, a conservative condition would be:
> >>
> >> TYPE_PRECISION (size_type_node) == 64
> >> || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> >>  && TYPE_PRECISION (reduct

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2022-01-31 Thread Tom de Vries via Gcc-patches

On 9/17/21 10:08, Richard Biener via Gcc-patches wrote:

On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
 wrote:


On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:

On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
 wrote:


On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
[...]


+  /* Handle strlen like loops.  */
+  if (store_dr == NULL
+  && integer_zerop (pattern)
+  && TREE_CODE (reduction_iv.base) == INTEGER_CST
+  && TREE_CODE (reduction_iv.step) == INTEGER_CST
+  && integer_onep (reduction_iv.step)
+  && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
+ || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
+{

I wonder what goes wrong with a larger or smaller wrapping IV type?
The iteration
only stops when you load a NUL and the increments just wrap along (you're
using the pointer IVs to compute the strlen result).  Can't you simply truncate?


I think truncation is enough as long as no overflow occurs in strlen or
strlen_using_rawmemchr.


For larger than size_type_node (actually larger than ptr_type_node would matter
I guess), the argument is that since pointer wrapping would be undefined anyway
the IV cannot wrap either.  Now, the correct check here would IMHO be

   TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
(ptr_type_node)
|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))

?


Regarding the implementation which makes use of rawmemchr:

We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
the maximal length we can determine of a string where each character has
size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
ptrdiff type is undefined we have to make sure that if an overflow
occurs, then an overflow occurs for reduction variable, too, and that
this is undefined, too.  However, I'm not sure anymore whether we want
to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
this would mean that a single string consumes more than half of the
virtual addressable memory.  At least for architectures where
TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
to neglect the case where computing pointer difference may overflow.
Otherwise we are talking about strings with lenghts of multiple
pebibytes.  For other architectures we might have to be more precise
and make sure that reduction variable overflows first and that this is
undefined.

Thus a conservative condition would be (I assumed that the size of any
integral type is a power of two which I'm not sure if this really holds;
IIRC the C standard requires only that the alignment is a power of two
but not necessarily the size so I might need to change this):

/* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 
(sizeof (load_type))
or in other words return true if reduction variable overflows first
and false otherwise.  */

static bool
reduction_var_overflows_first (tree reduction_var, tree load_type)
{
   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
(reduction_var));
   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT 
(load_type)));
   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
size_exponent);
}

TYPE_PRECISION (ptrdiff_type_node) == 64
|| (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
 && reduction_var_overflows_first (reduction_var, load_type)

Regarding the implementation which makes use of strlen:

I'm not sure what it means if strlen is called for a string with a
length greater than SIZE_MAX.  Therefore, similar to the implementation
using rawmemchr where we neglect the case of an overflow for 64bit
architectures, a conservative condition would be:

TYPE_PRECISION (size_type_node) == 64
|| (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
 && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))

I still included the overflow undefined check for reduction variable in
order to rule out situations where the reduction variable is unsigned
and overflows as many times until strlen(,_using_rawmemchr) overflows,
too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
architectures.  Anyhow, while writing this down it becomes clear that
this deserves a comment which I will add once it becomes clear which way
to go.


I think all the arguments about objects bigger than half of the address-space
also are valid for 32bit targets and thus 32bit size_type_node (or
32bit pointer size).
I'm not actually sure what's the canonical type to check against, whether
it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
middle-end "offset" type used for all address computations).  For weird reasons
I'd lean towards 'sizetype' (for example some embedded targets have 

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-10-11 Thread Stefan Schulze Frielinghaus via Gcc-patches
On Fri, Sep 17, 2021 at 10:08:27AM +0200, Richard Biener wrote:
> On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> > > On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> > >  wrote:
> > > >
> > > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > > > [...]
> > > > > > >
> > > > > > > +  /* Handle strlen like loops.  */
> > > > > > > +  if (store_dr == NULL
> > > > > > > +  && integer_zerop (pattern)
> > > > > > > +  && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > > > +  && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > > > +  && integer_onep (reduction_iv.step)
> > > > > > > +  && (types_compatible_p (TREE_TYPE (reduction_var), 
> > > > > > > size_type_node)
> > > > > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
> > > > > > > +{
> > > > > > >
> > > > > > > I wonder what goes wrong with a larger or smaller wrapping IV 
> > > > > > > type?
> > > > > > > The iteration
> > > > > > > only stops when you load a NUL and the increments just wrap along 
> > > > > > > (you're
> > > > > > > using the pointer IVs to compute the strlen result).  Can't you 
> > > > > > > simply truncate?
> > > > > >
> > > > > > I think truncation is enough as long as no overflow occurs in 
> > > > > > strlen or
> > > > > > strlen_using_rawmemchr.
> > > > > >
> > > > > > > For larger than size_type_node (actually larger than 
> > > > > > > ptr_type_node would matter
> > > > > > > I guess), the argument is that since pointer wrapping would be 
> > > > > > > undefined anyway
> > > > > > > the IV cannot wrap either.  Now, the correct check here would 
> > > > > > > IMHO be
> > > > > > >
> > > > > > >   TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > > > (ptr_type_node)
> > > > > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > Regarding the implementation which makes use of rawmemchr:
> > > > > >
> > > > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  
> > > > > > Thus,
> > > > > > the maximal length we can determine of a string where each 
> > > > > > character has
> > > > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow 
> > > > > > for
> > > > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > > > occurs, then an overflow occurs for reduction variable, too, and 
> > > > > > that
> > > > > > this is undefined, too.  However, I'm not sure anymore whether we 
> > > > > > want
> > > > > > to respect overflows in all cases.  If TYPE_PRECISION 
> > > > > > (ptr_type_node)
> > > > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, 
> > > > > > then
> > > > > > this would mean that a single string consumes more than half of the
> > > > > > virtual addressable memory.  At least for architectures where
> > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is 
> > > > > > reasonable
> > > > > > to neglect the case where computing pointer difference may overflow.
> > > > > > Otherwise we are talking about strings with lenghts of multiple
> > > > > > pebibytes.  For other architectures we might have to be more precise
> > > > > > and make sure that reduction variable overflows first and that this 
> > > > > > is
> > > > > > undefined.
> > > > > >
> > > > > > Thus a conservative condition would be (I assumed that the size of 
> > > > > > any
> > > > > > integral type is a power of two which I'm not sure if this really 
> > > > > > holds;
> > > > > > IIRC the C standard requires only that the alignment is a power of 
> > > > > > two
> > > > > > but not necessarily the size so I might need to change this):
> > > > > >
> > > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 
> > > > > > 1 - log2 (sizeof (load_type))
> > > > > >or in other words return true if reduction variable overflows 
> > > > > > first
> > > > > >and false otherwise.  */
> > > > > >
> > > > > > static bool
> > > > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > > > {
> > > > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> > > > > > (reduction_var));
> > > > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide 
> > > > > > (TYPE_SIZE_UNIT (load_type)));
> > > > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 
> > > > > > - size_exponent);
> > > > > > }
> > > > > >
> > > > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > > > && reduction_var_overflows_first (reduction_var, load_type)
> > > > > >
> > > > > > Regarding the implementation which makes use of strlen:
> > > > > >
> > > > > > I'm not sure what it means if strlen is called 

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-09-17 Thread Richard Biener via Gcc-patches
On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
 wrote:
>
> On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> > On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
> >  wrote:
> > >
> > > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > > [...]
> > > > > >
> > > > > > +  /* Handle strlen like loops.  */
> > > > > > +  if (store_dr == NULL
> > > > > > +  && integer_zerop (pattern)
> > > > > > +  && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > > +  && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > > +  && integer_onep (reduction_iv.step)
> > > > > > +  && (types_compatible_p (TREE_TYPE (reduction_var), 
> > > > > > size_type_node)
> > > > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
> > > > > > +{
> > > > > >
> > > > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > > > The iteration
> > > > > > only stops when you load a NUL and the increments just wrap along 
> > > > > > (you're
> > > > > > using the pointer IVs to compute the strlen result).  Can't you 
> > > > > > simply truncate?
> > > > >
> > > > > I think truncation is enough as long as no overflow occurs in strlen 
> > > > > or
> > > > > strlen_using_rawmemchr.
> > > > >
> > > > > > For larger than size_type_node (actually larger than ptr_type_node 
> > > > > > would matter
> > > > > > I guess), the argument is that since pointer wrapping would be 
> > > > > > undefined anyway
> > > > > > the IV cannot wrap either.  Now, the correct check here would IMHO 
> > > > > > be
> > > > > >
> > > > > >   TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > > (ptr_type_node)
> > > > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > > >
> > > > > > ?
> > > > >
> > > > > Regarding the implementation which makes use of rawmemchr:
> > > > >
> > > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  
> > > > > Thus,
> > > > > the maximal length we can determine of a string where each character 
> > > > > has
> > > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > > occurs, then an overflow occurs for reduction variable, too, and that
> > > > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > > > this would mean that a single string consumes more than half of the
> > > > > virtual addressable memory.  At least for architectures where
> > > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is 
> > > > > reasonable
> > > > > to neglect the case where computing pointer difference may overflow.
> > > > > Otherwise we are talking about strings with lenghts of multiple
> > > > > pebibytes.  For other architectures we might have to be more precise
> > > > > and make sure that reduction variable overflows first and that this is
> > > > > undefined.
> > > > >
> > > > > Thus a conservative condition would be (I assumed that the size of any
> > > > > integral type is a power of two which I'm not sure if this really 
> > > > > holds;
> > > > > IIRC the C standard requires only that the alignment is a power of two
> > > > > but not necessarily the size so I might need to change this):
> > > > >
> > > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 
> > > > > - log2 (sizeof (load_type))
> > > > >or in other words return true if reduction variable overflows first
> > > > >and false otherwise.  */
> > > > >
> > > > > static bool
> > > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > > {
> > > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> > > > > (reduction_var));
> > > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide 
> > > > > (TYPE_SIZE_UNIT (load_type)));
> > > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
> > > > > size_exponent);
> > > > > }
> > > > >
> > > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > > && reduction_var_overflows_first (reduction_var, load_type)
> > > > >
> > > > > Regarding the implementation which makes use of strlen:
> > > > >
> > > > > I'm not sure what it means if strlen is called for a string with a
> > > > > length greater than SIZE_MAX.  Therefore, similar to the 
> > > > > implementation
> > > > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > > > architectures, a conservative condition would be:
> > > > >
> > > > > TYPE_PRECISION (size_type_node) == 64
> > > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > > &&

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-09-13 Thread Stefan Schulze Frielinghaus via Gcc-patches
On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:
> On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> > [...]
> > > > >
> > > > > +  /* Handle strlen like loops.  */
> > > > > +  if (store_dr == NULL
> > > > > +  && integer_zerop (pattern)
> > > > > +  && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > > +  && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > > +  && integer_onep (reduction_iv.step)
> > > > > +  && (types_compatible_p (TREE_TYPE (reduction_var), 
> > > > > size_type_node)
> > > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
> > > > > +{
> > > > >
> > > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > > The iteration
> > > > > only stops when you load a NUL and the increments just wrap along 
> > > > > (you're
> > > > > using the pointer IVs to compute the strlen result).  Can't you 
> > > > > simply truncate?
> > > >
> > > > I think truncation is enough as long as no overflow occurs in strlen or
> > > > strlen_using_rawmemchr.
> > > >
> > > > > For larger than size_type_node (actually larger than ptr_type_node 
> > > > > would matter
> > > > > I guess), the argument is that since pointer wrapping would be 
> > > > > undefined anyway
> > > > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > > > >
> > > > >   TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > > (ptr_type_node)
> > > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > > >
> > > > > ?
> > > >
> > > > Regarding the implementation which makes use of rawmemchr:
> > > >
> > > > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > > > the maximal length we can determine of a string where each character has
> > > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > > ptrdiff type is undefined we have to make sure that if an overflow
> > > > occurs, then an overflow occurs for reduction variable, too, and that
> > > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > > this would mean that a single string consumes more than half of the
> > > > virtual addressable memory.  At least for architectures where
> > > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > > > to neglect the case where computing pointer difference may overflow.
> > > > Otherwise we are talking about strings with lenghts of multiple
> > > > pebibytes.  For other architectures we might have to be more precise
> > > > and make sure that reduction variable overflows first and that this is
> > > > undefined.
> > > >
> > > > Thus a conservative condition would be (I assumed that the size of any
> > > > integral type is a power of two which I'm not sure if this really holds;
> > > > IIRC the C standard requires only that the alignment is a power of two
> > > > but not necessarily the size so I might need to change this):
> > > >
> > > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - 
> > > > log2 (sizeof (load_type))
> > > >or in other words return true if reduction variable overflows first
> > > >and false otherwise.  */
> > > >
> > > > static bool
> > > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > > {
> > > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> > > > (reduction_var));
> > > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT 
> > > > (load_type)));
> > > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
> > > > size_exponent);
> > > > }
> > > >
> > > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > && reduction_var_overflows_first (reduction_var, load_type)
> > > >
> > > > Regarding the implementation which makes use of strlen:
> > > >
> > > > I'm not sure what it means if strlen is called for a string with a
> > > > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > > architectures, a conservative condition would be:
> > > >
> > > > TYPE_PRECISION (size_type_node) == 64
> > > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > > && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION 
> > > > (size_type_node))
> > > >
> > > > I still included the overflow undefined check for reduction variable in
> > > > order to rule out situations where the reduction variable is unsigned
> > > > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > > > too.  May

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-09-06 Thread Richard Biener via Gcc-patches
On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
 wrote:
>
> On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
> [...]
> > > >
> > > > +  /* Handle strlen like loops.  */
> > > > +  if (store_dr == NULL
> > > > +  && integer_zerop (pattern)
> > > > +  && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > > +  && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > > +  && integer_onep (reduction_iv.step)
> > > > +  && (types_compatible_p (TREE_TYPE (reduction_var), 
> > > > size_type_node)
> > > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
> > > > +{
> > > >
> > > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > > The iteration
> > > > only stops when you load a NUL and the increments just wrap along 
> > > > (you're
> > > > using the pointer IVs to compute the strlen result).  Can't you simply 
> > > > truncate?
> > >
> > > I think truncation is enough as long as no overflow occurs in strlen or
> > > strlen_using_rawmemchr.
> > >
> > > > For larger than size_type_node (actually larger than ptr_type_node 
> > > > would matter
> > > > I guess), the argument is that since pointer wrapping would be 
> > > > undefined anyway
> > > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > > >
> > > >   TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > > (ptr_type_node)
> > > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > > >
> > > > ?
> > >
> > > Regarding the implementation which makes use of rawmemchr:
> > >
> > > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > > the maximal length we can determine of a string where each character has
> > > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > > ptrdiff type is undefined we have to make sure that if an overflow
> > > occurs, then an overflow occurs for reduction variable, too, and that
> > > this is undefined, too.  However, I'm not sure anymore whether we want
> > > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > > this would mean that a single string consumes more than half of the
> > > virtual addressable memory.  At least for architectures where
> > > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > > to neglect the case where computing pointer difference may overflow.
> > > Otherwise we are talking about strings with lenghts of multiple
> > > pebibytes.  For other architectures we might have to be more precise
> > > and make sure that reduction variable overflows first and that this is
> > > undefined.
> > >
> > > Thus a conservative condition would be (I assumed that the size of any
> > > integral type is a power of two which I'm not sure if this really holds;
> > > IIRC the C standard requires only that the alignment is a power of two
> > > but not necessarily the size so I might need to change this):
> > >
> > > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - 
> > > log2 (sizeof (load_type))
> > >or in other words return true if reduction variable overflows first
> > >and false otherwise.  */
> > >
> > > static bool
> > > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > > {
> > >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> > >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> > > (reduction_var));
> > >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT 
> > > (load_type)));
> > >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
> > > size_exponent);
> > > }
> > >
> > > TYPE_PRECISION (ptrdiff_type_node) == 64
> > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > && reduction_var_overflows_first (reduction_var, load_type)
> > >
> > > Regarding the implementation which makes use of strlen:
> > >
> > > I'm not sure what it means if strlen is called for a string with a
> > > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > > using rawmemchr where we neglect the case of an overflow for 64bit
> > > architectures, a conservative condition would be:
> > >
> > > TYPE_PRECISION (size_type_node) == 64
> > > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > > && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> > >
> > > I still included the overflow undefined check for reduction variable in
> > > order to rule out situations where the reduction variable is unsigned
> > > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > > architectures.  Anyhow, while writing this down it becomes clear that
> > > this deserves a comment which I will add once it becomes clear which way
> > > to go.
> >
> > I think all the arguments about 

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-09-03 Thread Stefan Schulze Frielinghaus via Gcc-patches
On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
[...]
> > >
> > > +  /* Handle strlen like loops.  */
> > > +  if (store_dr == NULL
> > > +  && integer_zerop (pattern)
> > > +  && TREE_CODE (reduction_iv.base) == INTEGER_CST
> > > +  && TREE_CODE (reduction_iv.step) == INTEGER_CST
> > > +  && integer_onep (reduction_iv.step)
> > > +  && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
> > > + || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var
> > > +{
> > >
> > > I wonder what goes wrong with a larger or smaller wrapping IV type?
> > > The iteration
> > > only stops when you load a NUL and the increments just wrap along (you're
> > > using the pointer IVs to compute the strlen result).  Can't you simply 
> > > truncate?
> >
> > I think truncation is enough as long as no overflow occurs in strlen or
> > strlen_using_rawmemchr.
> >
> > > For larger than size_type_node (actually larger than ptr_type_node would 
> > > matter
> > > I guess), the argument is that since pointer wrapping would be undefined 
> > > anyway
> > > the IV cannot wrap either.  Now, the correct check here would IMHO be
> > >
> > >   TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
> > > (ptr_type_node)
> > >|| TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))
> > >
> > > ?
> >
> > Regarding the implementation which makes use of rawmemchr:
> >
> > We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
> > the maximal length we can determine of a string where each character has
> > size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
> > ptrdiff type is undefined we have to make sure that if an overflow
> > occurs, then an overflow occurs for reduction variable, too, and that
> > this is undefined, too.  However, I'm not sure anymore whether we want
> > to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
> > equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
> > this would mean that a single string consumes more than half of the
> > virtual addressable memory.  At least for architectures where
> > TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
> > to neglect the case where computing pointer difference may overflow.
> > Otherwise we are talking about strings with lenghts of multiple
> > pebibytes.  For other architectures we might have to be more precise
> > and make sure that reduction variable overflows first and that this is
> > undefined.
> >
> > Thus a conservative condition would be (I assumed that the size of any
> > integral type is a power of two which I'm not sure if this really holds;
> > IIRC the C standard requires only that the alignment is a power of two
> > but not necessarily the size so I might need to change this):
> >
> > /* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 
> > (sizeof (load_type))
> >or in other words return true if reduction variable overflows first
> >and false otherwise.  */
> >
> > static bool
> > reduction_var_overflows_first (tree reduction_var, tree load_type)
> > {
> >   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
> >   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
> > (reduction_var));
> >   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT 
> > (load_type)));
> >   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
> > size_exponent);
> > }
> >
> > TYPE_PRECISION (ptrdiff_type_node) == 64
> > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > && reduction_var_overflows_first (reduction_var, load_type)
> >
> > Regarding the implementation which makes use of strlen:
> >
> > I'm not sure what it means if strlen is called for a string with a
> > length greater than SIZE_MAX.  Therefore, similar to the implementation
> > using rawmemchr where we neglect the case of an overflow for 64bit
> > architectures, a conservative condition would be:
> >
> > TYPE_PRECISION (size_type_node) == 64
> > || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
> > && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))
> >
> > I still included the overflow undefined check for reduction variable in
> > order to rule out situations where the reduction variable is unsigned
> > and overflows as many times until strlen(,_using_rawmemchr) overflows,
> > too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
> > architectures.  Anyhow, while writing this down it becomes clear that
> > this deserves a comment which I will add once it becomes clear which way
> > to go.
> 
> I think all the arguments about objects bigger than half of the address-space
> also are valid for 32bit targets and thus 32bit size_type_node (or
> 32bit pointer size).
> I'm not actually sure what's the canonical type to check against, whether
> it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetyp

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-08-20 Thread Richard Biener via Gcc-patches
On Fri, Jun 25, 2021 at 12:23 PM Stefan Schulze Frielinghaus
 wrote:
>
> On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote:
> > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
> >  wrote:
> > >
> > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus 
> > > wrote:
> > > [...]
> > > > > but we won't ever arrive here because of the niters condition.  But
> > > > > yes, doing the pattern matching in the innermost loop processing code
> > > > > looks good to me - for the specific case it would be
> > > > >
> > > > >   /* Don't distribute loop if niters is unknown.  */
> > > > >   tree niters = number_of_latch_executions (loop);
> > > > >   if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > > ---> here?
> > > > > continue;
> > > >
> > > > Right, please find attached a new version of the patch where everything
> > > > is included in the loop distribution pass.  I will do a bootstrap and
> > > > regtest on IBM Z over night.  If you give me green light I will also do
> > > > the same on x86_64.
> > >
> > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> > > least the ldist-strlen testcase).  If you are Ok with the patch, then I
> > > would rebase and run the testsuites again and post a patch series
> > > including the rawmemchr implementation for IBM Z.
> >
> > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
> > *loop, vec *work_list)
> >return work_list->length () > 0;
> >  }
> >
> > +static void
> > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> > +   data_reference_p store_dr, tree base, tree 
> > pattern,
> > +   location_t loc)
> > +{
> >
> > this new function needs a comment.  Applies to all of the new ones, btw.
>
> Done.
>
> > +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> > +  && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE 
> > (pattern));
> >
> > this looks fragile and is probably unnecessary as well.
> >
> > +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> >
> > in general you want types_compatible_p () checks which for pointers means
> > all pointers are compatible ...
>
> True, I removed both asserts.
>
> > (skipping stuff)
> >
> > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
> >   && !optimize_loop_for_speed_p (loop)))
> > continue;
> >
> > -  /* Don't distribute loop if niters is unknown.  */
> > +  /* If niters is unknown don't distribute loop but rather try to 
> > transform
> > +it to a call to a builtin.  */
> >tree niters = number_of_latch_executions (loop);
> >if (niters == NULL_TREE || niters == chrec_dont_know)
> > -   continue;
> > +   {
> > + if (transform_reduction_loop (loop))
> > +   {
> > + changed = true;
> > + loops_to_be_destroyed.safe_push (loop);
> > + if (dump_file)
> > +   fprintf (dump_file, "Loop %d transformed into a
> > builtin.\n", loop->num);
> > +   }
> > + continue;
> > +   }
> >
> > please look at
> >
> >   if (nb_generated_loops + nb_generated_calls > 0)
> > {
> >   changed = true;
> >   if (dump_enabled_p ())
> > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >  loc, "Loop%s %d distributed: split to
> > %d loops "
> >  "and %d library calls.\n", str, loop->num,
> >  nb_generated_loops, nb_generated_calls);
> >
> > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
> > transforms are reported with -fopt-info-loop
>
> Done.
>
> > +
> > +  return transform_reduction_loop_1 (loop, load_dr, store_dr, 
> > reduction_var);
> > +}
> >
> > what's the point in tail-calling here and visually splitting the
> > function in half?
>
> In the first place I thought that this is more pleasant since in
> transform_reduction_loop_1 it is settled that we have a single load,
> store, and reduction variable.  After refactoring this isn't true
> anymore and I inlined the function and made this clear via a comment.
>
> >
> > (sorry for picking random pieces now ;))
> >
> > +  for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> > +  gsi_next (&bsi), ++ninsns)
> > +   {
> >
> > this counts debug insns, I guess you want gsi_next_nondebug at least.
> > not sure why you are counting PHIs at all btw - for the loops you match
> > you are expecting at most two, one IV and eventually one for the virtual
> > operand of the store?
>
> Yes, I removed the counting for the phi loop and changed to
> gsi_next_nondebug for both loops.
>
> >
> > + if (gimple_has_volatile_ops (phi))
> > +   return false;
> >
> > PHIs never have volatile ops.
> >
> > + if (gimple_clobber_p (phi))
> > +   

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-08-06 Thread Stefan Schulze Frielinghaus via Gcc-patches
ping

On Fri, Jun 25, 2021 at 12:23:32PM +0200, Stefan Schulze Frielinghaus wrote:
> On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote:
> > On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
> >  wrote:
> > >
> > > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus 
> > > wrote:
> > > [...]
> > > > > but we won't ever arrive here because of the niters condition.  But
> > > > > yes, doing the pattern matching in the innermost loop processing code
> > > > > looks good to me - for the specific case it would be
> > > > >
> > > > >   /* Don't distribute loop if niters is unknown.  */
> > > > >   tree niters = number_of_latch_executions (loop);
> > > > >   if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > > ---> here?
> > > > > continue;
> > > >
> > > > Right, please find attached a new version of the patch where everything
> > > > is included in the loop distribution pass.  I will do a bootstrap and
> > > > regtest on IBM Z over night.  If you give me green light I will also do
> > > > the same on x86_64.
> > >
> > > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> > > least the ldist-strlen testcase).  If you are Ok with the patch, then I
> > > would rebase and run the testsuites again and post a patch series
> > > including the rawmemchr implementation for IBM Z.
> > 
> > @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
> > *loop, vec *work_list)
> >return work_list->length () > 0;
> >  }
> > 
> > +static void
> > +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> > +   data_reference_p store_dr, tree base, tree 
> > pattern,
> > +   location_t loc)
> > +{
> > 
> > this new function needs a comment.  Applies to all of the new ones, btw.
> 
> Done.
> 
> > +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> > +  && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE 
> > (pattern));
> > 
> > this looks fragile and is probably unnecessary as well.
> > 
> > +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> > 
> > in general you want types_compatible_p () checks which for pointers means
> > all pointers are compatible ...
> 
> True, I removed both asserts.
> 
> > (skipping stuff)
> > 
> > @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
> >   && !optimize_loop_for_speed_p (loop)))
> > continue;
> > 
> > -  /* Don't distribute loop if niters is unknown.  */
> > +  /* If niters is unknown don't distribute loop but rather try to 
> > transform
> > +it to a call to a builtin.  */
> >tree niters = number_of_latch_executions (loop);
> >if (niters == NULL_TREE || niters == chrec_dont_know)
> > -   continue;
> > +   {
> > + if (transform_reduction_loop (loop))
> > +   {
> > + changed = true;
> > + loops_to_be_destroyed.safe_push (loop);
> > + if (dump_file)
> > +   fprintf (dump_file, "Loop %d transformed into a
> > builtin.\n", loop->num);
> > +   }
> > + continue;
> > +   }
> > 
> > please look at
> > 
> >   if (nb_generated_loops + nb_generated_calls > 0)
> > {
> >   changed = true;
> >   if (dump_enabled_p ())
> > dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
> >  loc, "Loop%s %d distributed: split to
> > %d loops "
> >  "and %d library calls.\n", str, loop->num,
> >  nb_generated_loops, nb_generated_calls);
> > 
> > and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
> > transforms are reported with -fopt-info-loop
> 
> Done.
> 
> > +
> > +  return transform_reduction_loop_1 (loop, load_dr, store_dr, 
> > reduction_var);
> > +}
> > 
> > what's the point in tail-calling here and visually splitting the
> > function in half?
> 
> In the first place I thought that this is more pleasant since in
> transform_reduction_loop_1 it is settled that we have a single load,
> store, and reduction variable.  After refactoring this isn't true
> anymore and I inlined the function and made this clear via a comment.
> 
> > 
> > (sorry for picking random pieces now ;))
> > 
> > +  for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> > +  gsi_next (&bsi), ++ninsns)
> > +   {
> > 
> > this counts debug insns, I guess you want gsi_next_nondebug at least.
> > not sure why you are counting PHIs at all btw - for the loops you match
> > you are expecting at most two, one IV and eventually one for the virtual
> > operand of the store?
> 
> Yes, I removed the counting for the phi loop and changed to
> gsi_next_nondebug for both loops.
> 
> > 
> > + if (gimple_has_volatile_ops (phi))
> > +   return false;
> > 
> > PHIs never have volatile ops.
> > 
> > + 

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-06-25 Thread Stefan Schulze Frielinghaus via Gcc-patches
On Wed, Jun 16, 2021 at 04:22:35PM +0200, Richard Biener wrote:
> On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
> > [...]
> > > > but we won't ever arrive here because of the niters condition.  But
> > > > yes, doing the pattern matching in the innermost loop processing code
> > > > looks good to me - for the specific case it would be
> > > >
> > > >   /* Don't distribute loop if niters is unknown.  */
> > > >   tree niters = number_of_latch_executions (loop);
> > > >   if (niters == NULL_TREE || niters == chrec_dont_know)
> > > > ---> here?
> > > > continue;
> > >
> > > Right, please find attached a new version of the patch where everything
> > > is included in the loop distribution pass.  I will do a bootstrap and
> > > regtest on IBM Z over night.  If you give me green light I will also do
> > > the same on x86_64.
> >
> > Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> > least the ldist-strlen testcase).  If you are Ok with the patch, then I
> > would rebase and run the testsuites again and post a patch series
> > including the rawmemchr implementation for IBM Z.
> 
> @@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
> *loop, vec *work_list)
>return work_list->length () > 0;
>  }
> 
> +static void
> +generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
> +   data_reference_p store_dr, tree base, tree 
> pattern,
> +   location_t loc)
> +{
> 
> this new function needs a comment.  Applies to all of the new ones, btw.

Done.

> +  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
> +  && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE 
> (pattern));
> 
> this looks fragile and is probably unnecessary as well.
> 
> +  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));
> 
> in general you want types_compatible_p () checks which for pointers means
> all pointers are compatible ...

True, I removed both asserts.

> (skipping stuff)
> 
> @@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
>   && !optimize_loop_for_speed_p (loop)))
> continue;
> 
> -  /* Don't distribute loop if niters is unknown.  */
> +  /* If niters is unknown don't distribute loop but rather try to 
> transform
> +it to a call to a builtin.  */
>tree niters = number_of_latch_executions (loop);
>if (niters == NULL_TREE || niters == chrec_dont_know)
> -   continue;
> +   {
> + if (transform_reduction_loop (loop))
> +   {
> + changed = true;
> + loops_to_be_destroyed.safe_push (loop);
> + if (dump_file)
> +   fprintf (dump_file, "Loop %d transformed into a
> builtin.\n", loop->num);
> +   }
> + continue;
> +   }
> 
> please look at
> 
>   if (nb_generated_loops + nb_generated_calls > 0)
> {
>   changed = true;
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
>  loc, "Loop%s %d distributed: split to
> %d loops "
>  "and %d library calls.\n", str, loop->num,
>  nb_generated_loops, nb_generated_calls);
> 
> and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
> transforms are reported with -fopt-info-loop

Done.

> +
> +  return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var);
> +}
> 
> what's the point in tail-calling here and visually splitting the
> function in half?

In the first place I thought that this is more pleasant since in
transform_reduction_loop_1 it is settled that we have a single load,
store, and reduction variable.  After refactoring this isn't true
anymore and I inlined the function and made this clear via a comment.

> 
> (sorry for picking random pieces now ;))
> 
> +  for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
> +  gsi_next (&bsi), ++ninsns)
> +   {
> 
> this counts debug insns, I guess you want gsi_next_nondebug at least.
> not sure why you are counting PHIs at all btw - for the loops you match
> you are expecting at most two, one IV and eventually one for the virtual
> operand of the store?

Yes, I removed the counting for the phi loop and changed to
gsi_next_nondebug for both loops.

> 
> + if (gimple_has_volatile_ops (phi))
> +   return false;
> 
> PHIs never have volatile ops.
> 
> + if (gimple_clobber_p (phi))
> +   continue;
> 
> or are clobbers.

Removed both.

> Btw, can you factor out a helper from find_single_drs working on a
> stmt to reduce code duplication?

Ahh sorry for that.  I've already done this in one of my first patches
but didn't copy that over.  Although my changes do not require a RDG the
whole pass is base

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-06-16 Thread Richard Biener via Gcc-patches
On Mon, Jun 14, 2021 at 7:26 PM Stefan Schulze Frielinghaus
 wrote:
>
> On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
> [...]
> > > but we won't ever arrive here because of the niters condition.  But
> > > yes, doing the pattern matching in the innermost loop processing code
> > > looks good to me - for the specific case it would be
> > >
> > >   /* Don't distribute loop if niters is unknown.  */
> > >   tree niters = number_of_latch_executions (loop);
> > >   if (niters == NULL_TREE || niters == chrec_dont_know)
> > > ---> here?
> > > continue;
> >
> > Right, please find attached a new version of the patch where everything
> > is included in the loop distribution pass.  I will do a bootstrap and
> > regtest on IBM Z over night.  If you give me green light I will also do
> > the same on x86_64.
>
> Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
> least the ldist-strlen testcase).  If you are Ok with the patch, then I
> would rebase and run the testsuites again and post a patch series
> including the rawmemchr implementation for IBM Z.

@@ -3257,6 +3261,464 @@ find_seed_stmts_for_distribution (class loop
*loop, vec *work_list)
   return work_list->length () > 0;
 }

+static void
+generate_rawmemchr_builtin (loop_p loop, tree reduction_var,
+   data_reference_p store_dr, tree base, tree pattern,
+   location_t loc)
+{

this new function needs a comment.  Applies to all of the new ones, btw.

+  gcc_checking_assert (POINTER_TYPE_P (TREE_TYPE (base))
+  && TREE_TYPE (TREE_TYPE (base)) == TREE_TYPE (pattern));

this looks fragile and is probably unnecessary as well.

+  gcc_checking_assert (TREE_TYPE (reduction_var) == TREE_TYPE (base));

in general you want types_compatible_p () checks which for pointers means
all pointers are compatible ...

(skipping stuff)

@@ -3321,10 +3783,20 @@ loop_distribution::execute (function *fun)
  && !optimize_loop_for_speed_p (loop)))
continue;

-  /* Don't distribute loop if niters is unknown.  */
+  /* If niters is unknown don't distribute loop but rather try to transform
+it to a call to a builtin.  */
   tree niters = number_of_latch_executions (loop);
   if (niters == NULL_TREE || niters == chrec_dont_know)
-   continue;
+   {
+ if (transform_reduction_loop (loop))
+   {
+ changed = true;
+ loops_to_be_destroyed.safe_push (loop);
+ if (dump_file)
+   fprintf (dump_file, "Loop %d transformed into a
builtin.\n", loop->num);
+   }
+ continue;
+   }

please look at

  if (nb_generated_loops + nb_generated_calls > 0)
{
  changed = true;
  if (dump_enabled_p ())
dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
 loc, "Loop%s %d distributed: split to
%d loops "
 "and %d library calls.\n", str, loop->num,
 nb_generated_loops, nb_generated_calls);

and follow the use of dump_* and MSG_OPTIMIZED_LOCATIONS so the
transforms are reported with -fopt-info-loop

+
+  return transform_reduction_loop_1 (loop, load_dr, store_dr, reduction_var);
+}

what's the point in tail-calling here and visually splitting the
function in half?

(sorry for picking random pieces now ;))

+  for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi);
+  gsi_next (&bsi), ++ninsns)
+   {

this counts debug insns, I guess you want gsi_next_nondebug at least.
not sure why you are counting PHIs at all btw - for the loops you match
you are expecting at most two, one IV and eventually one for the virtual
operand of the store?

+ if (gimple_has_volatile_ops (phi))
+   return false;

PHIs never have volatile ops.

+ if (gimple_clobber_p (phi))
+   continue;

or are clobbers.

Btw, can you factor out a helper from find_single_drs working on a
stmt to reduce code duplication?

+  tree reduction_var;
+  switch (gimple_code (reduction_stmt))
+{
+case GIMPLE_PHI:
+  reduction_var = gimple_phi_result (reduction_stmt);
+  break;
+case GIMPLE_ASSIGN:
+  reduction_var = gimple_assign_lhs (reduction_stmt);
+  break;
+default:
+  /* Bail out e.g. for GIMPLE_CALL.  */
+  return false;

gimple_get_lhs (reduction_stmt); would work for both PHIs
and assigns.

+  if (reduction_var == NULL)
+return false;

it can never be NULL here.

+  /* Bail out if this is a bitfield memory reference.  */
+  if (TREE_CODE (DR_REF (load_dr)) == COMPONENT_REF
+  && DECL_BIT_FIELD (TREE_OPERAND (DR_REF (load_dr), 1)))
+return false;
...

I see this is again quite some code copied from find_single_drs, please
see how to avoid this much duplication by splitting out helpers.

+static bool
+transform_reduction_loop_1 (loop_p loop,
+ 

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-06-14 Thread Stefan Schulze Frielinghaus via Gcc-patches
On Thu, May 20, 2021 at 08:37:24PM +0200, Stefan Schulze Frielinghaus wrote:
[...]
> > but we won't ever arrive here because of the niters condition.  But
> > yes, doing the pattern matching in the innermost loop processing code
> > looks good to me - for the specific case it would be
> > 
> >   /* Don't distribute loop if niters is unknown.  */
> >   tree niters = number_of_latch_executions (loop);
> >   if (niters == NULL_TREE || niters == chrec_dont_know)
> > ---> here?
> > continue;
> 
> Right, please find attached a new version of the patch where everything
> is included in the loop distribution pass.  I will do a bootstrap and
> regtest on IBM Z over night.  If you give me green light I will also do
> the same on x86_64.

Meanwhile I gave it a shot on x86_64 where the testsuite runs fine (at
least the ldist-strlen testcase).  If you are Ok with the patch, then I
would rebase and run the testsuites again and post a patch series
including the rawmemchr implementation for IBM Z.

Cheers,
Stefan


Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-05-20 Thread Stefan Schulze Frielinghaus via Gcc-patches
On Thu, May 20, 2021 at 11:24:57AM +0200, Richard Biener wrote:
> On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> > > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> > >  wrote:
> > > >
> > > > [snip]
> > > >
> > > > Please find attached a new version of the patch.  A major change 
> > > > compared to
> > > > the previous patch is that I created a separate pass which hopefully 
> > > > makes
> > > > reviewing also easier since it is almost self-contained.  After 
> > > > realizing that
> > > > detecting loops which mimic the behavior of rawmemchr/strlen functions 
> > > > does not
> > > > really fit into the topic of loop distribution, I created a separate 
> > > > pass.
> > >
> > > It's true that these reduction-like patterns are more difficult than
> > > the existing
> > > memcpy/memset cases.
> > >
> > > >  Due
> > > > to this I was also able to play around a bit and schedule the pass at 
> > > > different
> > > > times.  Currently it is scheduled right before loop distribution where 
> > > > loop
> > > > header copying already took place which leads to the following effect.
> > >
> > > In fact I'd schedule it after loop distribution so there's the chance 
> > > that loop
> > > distribution can expose a loop that fits the new pattern.
> > >
> > > >  Running
> > > > this setup over
> > > >
> > > > char *t (char *p)
> > > > {
> > > >   for (; *p; ++p);
> > > >   return p;
> > > > }
> > > >
> > > > the new pass transforms
> > > >
> > > > char * t (char * p)
> > > > {
> > > >   char _1;
> > > >   char _7;
> > > >
> > > >[local count: 118111600]:
> > > >   _7 = *p_3(D);
> > > >   if (_7 != 0)
> > > > goto ; [89.00%]
> > > >   else
> > > > goto ; [11.00%]
> > > >
> > > >[local count: 105119324]:
> > > >
> > > >[local count: 955630225]:
> > > >   # p_8 = PHI 
> > > >   p_6 = p_8 + 1;
> > > >   _1 = *p_6;
> > > >   if (_1 != 0)
> > > > goto ; [89.00%]
> > > >   else
> > > > goto ; [11.00%]
> > > >
> > > >[local count: 105119324]:
> > > >   # p_2 = PHI 
> > > >   goto ; [100.00%]
> > > >
> > > >[local count: 850510901]:
> > > >   goto ; [100.00%]
> > > >
> > > >[local count: 12992276]:
> > > >
> > > >[local count: 118111600]:
> > > >   # p_9 = PHI 
> > > >   return p_9;
> > > >
> > > > }
> > > >
> > > > into
> > > >
> > > > char * t (char * p)
> > > > {
> > > >   char * _5;
> > > >   char _7;
> > > >
> > > >[local count: 118111600]:
> > > >   _7 = *p_3(D);
> > > >   if (_7 != 0)
> > > > goto ; [89.00%]
> > > >   else
> > > > goto ; [11.00%]
> > > >
> > > >[local count: 105119324]:
> > > >   _5 = p_3(D) + 1;
> > > >   p_10 = .RAWMEMCHR (_5, 0);
> > > >
> > > >[local count: 118111600]:
> > > >   # p_9 = PHI 
> > > >   return p_9;
> > > >
> > > > }
> > > >
> > > > which is fine so far.  However, I haven't made up my mind so far 
> > > > whether it is
> > > > worthwhile to spend more time in order to also eliminate the "first 
> > > > unrolling"
> > > > of the loop.
> > >
> > > Might be a phiopt transform ;)  Might apply to quite some set of
> > > builtins.  I wonder how the strlen case looks like though.
> > >
> > > > I gave it a shot by scheduling the pass prior pass copy header
> > > > and ended up with:
> > > >
> > > > char * t (char * p)
> > > > {
> > > >[local count: 118111600]:
> > > >   p_5 = .RAWMEMCHR (p_3(D), 0);
> > > >   return p_5;
> > > >
> > > > }
> > > >
> > > > which seems optimal to me.  The downside of this is that I have to 
> > > > initialize
> > > > scalar evolution analysis which might be undesired that early.
> > > >
> > > > All this brings me to the question where do you see this peace of code 
> > > > running?
> > > > If in a separate pass when would you schedule it?  If in an existing 
> > > > pass,
> > > > which one would you choose?
> > >
> > > I think it still fits loop distribution.  If you manage to detect it
> > > with your pass
> > > standalone then you should be able to detect it in loop distribution.
> >
> > If a loop is distributed only because one of the partitions matches a
> > rawmemchr/strlen-like loop pattern, then we have at least two partitions
> > which walk over the same memory region.  Since a rawmemchr/strlen-like
> > loop has no body (neglecting expression-3 of a for-loop where just an
> > increment happens) it is governed by the memory accesses in the loop
> > condition.  Therefore, in such a case loop distribution would result in
> > performance degradation.  This is why I think that it does not fit
> > conceptually into ldist pass.  However, since I make use of a couple of
> > helper functions from ldist pass, it may still fit technically.
> >
> > Since currently all ldist optimizations operate over loops where niters
> > is known and for rawmemchr/strlen-like loops this is not the case, it is
> > not possible that those optimizations expose a loop which is suitable
> > for rawmemchr/strlen

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-05-20 Thread Richard Biener via Gcc-patches
On Fri, May 7, 2021 at 2:32 PM Stefan Schulze Frielinghaus
 wrote:
>
> On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> > On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
> >  wrote:
> > >
> > > [snip]
> > >
> > > Please find attached a new version of the patch.  A major change compared 
> > > to
> > > the previous patch is that I created a separate pass which hopefully makes
> > > reviewing also easier since it is almost self-contained.  After realizing 
> > > that
> > > detecting loops which mimic the behavior of rawmemchr/strlen functions 
> > > does not
> > > really fit into the topic of loop distribution, I created a separate pass.
> >
> > It's true that these reduction-like patterns are more difficult than
> > the existing
> > memcpy/memset cases.
> >
> > >  Due
> > > to this I was also able to play around a bit and schedule the pass at 
> > > different
> > > times.  Currently it is scheduled right before loop distribution where 
> > > loop
> > > header copying already took place which leads to the following effect.
> >
> > In fact I'd schedule it after loop distribution so there's the chance that 
> > loop
> > distribution can expose a loop that fits the new pattern.
> >
> > >  Running
> > > this setup over
> > >
> > > char *t (char *p)
> > > {
> > >   for (; *p; ++p);
> > >   return p;
> > > }
> > >
> > > the new pass transforms
> > >
> > > char * t (char * p)
> > > {
> > >   char _1;
> > >   char _7;
> > >
> > >[local count: 118111600]:
> > >   _7 = *p_3(D);
> > >   if (_7 != 0)
> > > goto ; [89.00%]
> > >   else
> > > goto ; [11.00%]
> > >
> > >[local count: 105119324]:
> > >
> > >[local count: 955630225]:
> > >   # p_8 = PHI 
> > >   p_6 = p_8 + 1;
> > >   _1 = *p_6;
> > >   if (_1 != 0)
> > > goto ; [89.00%]
> > >   else
> > > goto ; [11.00%]
> > >
> > >[local count: 105119324]:
> > >   # p_2 = PHI 
> > >   goto ; [100.00%]
> > >
> > >[local count: 850510901]:
> > >   goto ; [100.00%]
> > >
> > >[local count: 12992276]:
> > >
> > >[local count: 118111600]:
> > >   # p_9 = PHI 
> > >   return p_9;
> > >
> > > }
> > >
> > > into
> > >
> > > char * t (char * p)
> > > {
> > >   char * _5;
> > >   char _7;
> > >
> > >[local count: 118111600]:
> > >   _7 = *p_3(D);
> > >   if (_7 != 0)
> > > goto ; [89.00%]
> > >   else
> > > goto ; [11.00%]
> > >
> > >[local count: 105119324]:
> > >   _5 = p_3(D) + 1;
> > >   p_10 = .RAWMEMCHR (_5, 0);
> > >
> > >[local count: 118111600]:
> > >   # p_9 = PHI 
> > >   return p_9;
> > >
> > > }
> > >
> > > which is fine so far.  However, I haven't made up my mind so far whether 
> > > it is
> > > worthwhile to spend more time in order to also eliminate the "first 
> > > unrolling"
> > > of the loop.
> >
> > Might be a phiopt transform ;)  Might apply to quite some set of
> > builtins.  I wonder how the strlen case looks like though.
> >
> > > I gave it a shot by scheduling the pass prior pass copy header
> > > and ended up with:
> > >
> > > char * t (char * p)
> > > {
> > >[local count: 118111600]:
> > >   p_5 = .RAWMEMCHR (p_3(D), 0);
> > >   return p_5;
> > >
> > > }
> > >
> > > which seems optimal to me.  The downside of this is that I have to 
> > > initialize
> > > scalar evolution analysis which might be undesired that early.
> > >
> > > All this brings me to the question where do you see this peace of code 
> > > running?
> > > If in a separate pass when would you schedule it?  If in an existing pass,
> > > which one would you choose?
> >
> > I think it still fits loop distribution.  If you manage to detect it
> > with your pass
> > standalone then you should be able to detect it in loop distribution.
>
> If a loop is distributed only because one of the partitions matches a
> rawmemchr/strlen-like loop pattern, then we have at least two partitions
> which walk over the same memory region.  Since a rawmemchr/strlen-like
> loop has no body (neglecting expression-3 of a for-loop where just an
> increment happens) it is governed by the memory accesses in the loop
> condition.  Therefore, in such a case loop distribution would result in
> performance degradation.  This is why I think that it does not fit
> conceptually into ldist pass.  However, since I make use of a couple of
> helper functions from ldist pass, it may still fit technically.
>
> Since currently all ldist optimizations operate over loops where niters
> is known and for rawmemchr/strlen-like loops this is not the case, it is
> not possible that those optimizations expose a loop which is suitable
> for rawmemchr/strlen optimization.

True - though that seems to be an unnecessary restriction.

>  Therefore, what do you think about
> scheduling rawmemchr/strlen optimization right between those
> if-statements of function loop_distribution::execute?
>
>if (nb_generated_loops + nb_generated_calls > 0)
>  {
>changed = true;
>if (dump_enabled_p ())
>  dump_printf_loc (MSG_OPTIMIZED_

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-05-07 Thread Stefan Schulze Frielinghaus via Gcc-patches
On Wed, May 05, 2021 at 11:36:41AM +0200, Richard Biener wrote:
> On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > [snip]
> >
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing 
> > that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does 
> > not
> > really fit into the topic of loop distribution, I created a separate pass.
> 
> It's true that these reduction-like patterns are more difficult than
> the existing
> memcpy/memset cases.
> 
> >  Due
> > to this I was also able to play around a bit and schedule the pass at 
> > different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.
> 
> In fact I'd schedule it after loop distribution so there's the chance that 
> loop
> distribution can expose a loop that fits the new pattern.
> 
> >  Running
> > this setup over
> >
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > the new pass transforms
> >
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> >
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >
> >[local count: 955630225]:
> >   # p_8 = PHI 
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >   # p_2 = PHI 
> >   goto ; [100.00%]
> >
> >[local count: 850510901]:
> >   goto ; [100.00%]
> >
> >[local count: 12992276]:
> >
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> >
> > }
> >
> > into
> >
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> >
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> >
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> >
> > }
> >
> > which is fine so far.  However, I haven't made up my mind so far whether it 
> > is
> > worthwhile to spend more time in order to also eliminate the "first 
> > unrolling"
> > of the loop.
> 
> Might be a phiopt transform ;)  Might apply to quite some set of
> builtins.  I wonder how the strlen case looks like though.
> 
> > I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> >
> > char * t (char * p)
> > {
> >[local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> >
> > }
> >
> > which seems optimal to me.  The downside of this is that I have to 
> > initialize
> > scalar evolution analysis which might be undesired that early.
> >
> > All this brings me to the question where do you see this peace of code 
> > running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
> 
> I think it still fits loop distribution.  If you manage to detect it
> with your pass
> standalone then you should be able to detect it in loop distribution.

If a loop is distributed only because one of the partitions matches a
rawmemchr/strlen-like loop pattern, then we have at least two partitions
which walk over the same memory region.  Since a rawmemchr/strlen-like
loop has no body (neglecting expression-3 of a for-loop where just an
increment happens) it is governed by the memory accesses in the loop
condition.  Therefore, in such a case loop distribution would result in
performance degradation.  This is why I think that it does not fit
conceptually into ldist pass.  However, since I make use of a couple of
helper functions from ldist pass, it may still fit technically.

Since currently all ldist optimizations operate over loops where niters
is known and for rawmemchr/strlen-like loops this is not the case, it is
not possible that those optimizations expose a loop which is suitable
for rawmemchr/strlen optimization.  Therefore, what do you think about
scheduling rawmemchr/strlen optimization right between those
if-statements of function loop_distribution::execute?

   if (nb_generated_loops + nb_generated_calls > 0)
 {
   changed = true;
   if (dump_enabled_p ())
 dump_printf_loc (MSG_OPTIMIZED_LOCATIONS,
  loc, "Loop%s %d distributed: split to %d loops "
  "and %d library calls.\n", str, loop->num,
  nb_generated_loops, nb_generated_calls);

   break;
 }

   // rawmemchr/strlen like loops

   if (dump_file && (dump_flags & TDF_DETAILS))
 fprintf (dump_file, "Loop%s %d not distributed.\n", str, loop->num);

> Can you
> explain what part is "easier" as 

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-05-05 Thread Richard Biener via Gcc-patches
On Wed, May 5, 2021 at 11:36 AM Richard Biener
 wrote:
>
> On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
>  wrote:
> >
> > [snip]
> >
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing 
> > that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does 
> > not
> > really fit into the topic of loop distribution, I created a separate pass.
>
> It's true that these reduction-like patterns are more difficult than
> the existing
> memcpy/memset cases.
>
> >  Due
> > to this I was also able to play around a bit and schedule the pass at 
> > different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.
>
> In fact I'd schedule it after loop distribution so there's the chance that 
> loop
> distribution can expose a loop that fits the new pattern.
>
> >  Running
> > this setup over
> >
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > the new pass transforms
> >
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> >
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >
> >[local count: 955630225]:
> >   # p_8 = PHI 
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >   # p_2 = PHI 
> >   goto ; [100.00%]
> >
> >[local count: 850510901]:
> >   goto ; [100.00%]
> >
> >[local count: 12992276]:
> >
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> >
> > }
> >
> > into
> >
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> >
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> >
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> >
> > }
> >
> > which is fine so far.  However, I haven't made up my mind so far whether it 
> > is
> > worthwhile to spend more time in order to also eliminate the "first 
> > unrolling"
> > of the loop.
>
> Might be a phiopt transform ;)  Might apply to quite some set of
> builtins.  I wonder how the strlen case looks like though.
>
> > I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> >
> > char * t (char * p)
> > {
> >[local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> >
> > }
> >
> > which seems optimal to me.  The downside of this is that I have to 
> > initialize
> > scalar evolution analysis which might be undesired that early.
> >
> > All this brings me to the question where do you see this peace of code 
> > running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
>
> I think it still fits loop distribution.  If you manage to detect it
> with your pass
> standalone then you should be able to detect it in loop distribution.  Can you
> explain what part is "easier" as standalone pass?

Btw, another "fitting" pass would be final value replacement (pass_scev_cprop)
since what these patterns provide is a builtin call to compute the value of one
of the loop PHIs on exit.  Note this pass leaves removal of in-loop computations
to followup DCE which means that in some cases it does unprofitable transforms.
There's a bug somewhere where I worked on doing final value replacement
on-demand when DCE figures out the loop is otherwise dead but I never finished
this (loop distribution could also use such mechanism to get rid of
unwanted PHIs).

> > Another topic which came up is whether there exists a more elegant solution 
> > to
> > my current implementation in order to deal with stores (I'm speaking of the 
> > `if
> > (store_dr)` statement inside of function transform_loop_1).  For example,
> >
> > extern char *p;
> > char *t ()
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> >
> > ends up as
> >
> > char * t ()
> > {
> >   char * _1;
> >   char * _2;
> >   char _3;
> >   char * p.1_8;
> >   char _9;
> >   char * p.1_10;
> >   char * p.1_11;
> >
> >[local count: 118111600]:
> >   p.1_8 = p;
> >   _9 = *p.1_8;
> >   if (_9 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >
> >[local count: 955630225]:
> >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> >   _1 = p.1_10 + 1;
> >   p = _1;
> >   _3 = *_1;
> >   if (_3 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> >
> >[local count: 105119324]:
> >   # _2 = PHI <_1(3)>
> >   goto ; [100.00%]
> >
> >[local count: 8505109

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-05-05 Thread Richard Biener via Gcc-patches
On Tue, Mar 16, 2021 at 6:13 PM Stefan Schulze Frielinghaus
 wrote:
>
> [snip]
>
> Please find attached a new version of the patch.  A major change compared to
> the previous patch is that I created a separate pass which hopefully makes
> reviewing also easier since it is almost self-contained.  After realizing that
> detecting loops which mimic the behavior of rawmemchr/strlen functions does 
> not
> really fit into the topic of loop distribution, I created a separate pass.

It's true that these reduction-like patterns are more difficult than
the existing
memcpy/memset cases.

>  Due
> to this I was also able to play around a bit and schedule the pass at 
> different
> times.  Currently it is scheduled right before loop distribution where loop
> header copying already took place which leads to the following effect.

In fact I'd schedule it after loop distribution so there's the chance that loop
distribution can expose a loop that fits the new pattern.

>  Running
> this setup over
>
> char *t (char *p)
> {
>   for (; *p; ++p);
>   return p;
> }
>
> the new pass transforms
>
> char * t (char * p)
> {
>   char _1;
>   char _7;
>
>[local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
>
>[local count: 105119324]:
>
>[local count: 955630225]:
>   # p_8 = PHI 
>   p_6 = p_8 + 1;
>   _1 = *p_6;
>   if (_1 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
>
>[local count: 105119324]:
>   # p_2 = PHI 
>   goto ; [100.00%]
>
>[local count: 850510901]:
>   goto ; [100.00%]
>
>[local count: 12992276]:
>
>[local count: 118111600]:
>   # p_9 = PHI 
>   return p_9;
>
> }
>
> into
>
> char * t (char * p)
> {
>   char * _5;
>   char _7;
>
>[local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
>
>[local count: 105119324]:
>   _5 = p_3(D) + 1;
>   p_10 = .RAWMEMCHR (_5, 0);
>
>[local count: 118111600]:
>   # p_9 = PHI 
>   return p_9;
>
> }
>
> which is fine so far.  However, I haven't made up my mind so far whether it is
> worthwhile to spend more time in order to also eliminate the "first unrolling"
> of the loop.

Might be a phiopt transform ;)  Might apply to quite some set of
builtins.  I wonder how the strlen case looks like though.

> I gave it a shot by scheduling the pass prior pass copy header
> and ended up with:
>
> char * t (char * p)
> {
>[local count: 118111600]:
>   p_5 = .RAWMEMCHR (p_3(D), 0);
>   return p_5;
>
> }
>
> which seems optimal to me.  The downside of this is that I have to initialize
> scalar evolution analysis which might be undesired that early.
>
> All this brings me to the question where do you see this peace of code 
> running?
> If in a separate pass when would you schedule it?  If in an existing pass,
> which one would you choose?

I think it still fits loop distribution.  If you manage to detect it
with your pass
standalone then you should be able to detect it in loop distribution.  Can you
explain what part is "easier" as standalone pass?

> Another topic which came up is whether there exists a more elegant solution to
> my current implementation in order to deal with stores (I'm speaking of the 
> `if
> (store_dr)` statement inside of function transform_loop_1).  For example,
>
> extern char *p;
> char *t ()
> {
>   for (; *p; ++p);
>   return p;
> }
>
> ends up as
>
> char * t ()
> {
>   char * _1;
>   char * _2;
>   char _3;
>   char * p.1_8;
>   char _9;
>   char * p.1_10;
>   char * p.1_11;
>
>[local count: 118111600]:
>   p.1_8 = p;
>   _9 = *p.1_8;
>   if (_9 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
>
>[local count: 105119324]:
>
>[local count: 955630225]:
>   # p.1_10 = PHI <_1(6), p.1_8(5)>
>   _1 = p.1_10 + 1;
>   p = _1;
>   _3 = *_1;
>   if (_3 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
>
>[local count: 105119324]:
>   # _2 = PHI <_1(3)>
>   goto ; [100.00%]
>
>[local count: 850510901]:
>   goto ; [100.00%]
>
>[local count: 12992276]:
>
>[local count: 118111600]:
>   # p.1_11 = PHI <_2(8), p.1_8(7)>
>   return p.1_11;
>
> }
>
> where inside the loop a load and store occurs.  For a rawmemchr like loop I
> have to show that we never load from a memory location to which we write.
> Currently I solve this by hard coding those facts which are not generic at 
> all.
> I gave compute_data_dependences_for_loop a try which failed to determine the
> fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> there are more generic solutions to express this in contrast to my current 
> one?

So the example loop is not valid to be trasformed to rawmemchr since it's
valid to call it with p = &p; - but sure, once you pass the first *p != 0 check
things become fishy but GCC isn't able to turn that into a non-dependence.

Why is the case of stores inside the loop important?  In fact that you
think it is
makes a case for integrating this with 

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-05-04 Thread Stefan Schulze Frielinghaus via Gcc-patches
ping

On Thu, Apr 08, 2021 at 10:23:31AM +0200, Stefan Schulze Frielinghaus wrote:
> ping
> 
> On Tue, Mar 16, 2021 at 06:13:21PM +0100, Stefan Schulze Frielinghaus wrote:
> > [snip]
> > 
> > Please find attached a new version of the patch.  A major change compared to
> > the previous patch is that I created a separate pass which hopefully makes
> > reviewing also easier since it is almost self-contained.  After realizing 
> > that
> > detecting loops which mimic the behavior of rawmemchr/strlen functions does 
> > not
> > really fit into the topic of loop distribution, I created a separate pass.  
> > Due
> > to this I was also able to play around a bit and schedule the pass at 
> > different
> > times.  Currently it is scheduled right before loop distribution where loop
> > header copying already took place which leads to the following effect.  
> > Running
> > this setup over
> > 
> > char *t (char *p)
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> > 
> > the new pass transforms
> > 
> > char * t (char * p)
> > {
> >   char _1;
> >   char _7;
> > 
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> > 
> >[local count: 955630225]:
> >   # p_8 = PHI 
> >   p_6 = p_8 + 1;
> >   _1 = *p_6;
> >   if (_1 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> >   # p_2 = PHI 
> >   goto ; [100.00%]
> > 
> >[local count: 850510901]:
> >   goto ; [100.00%]
> > 
> >[local count: 12992276]:
> > 
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> > 
> > }
> > 
> > into
> > 
> > char * t (char * p)
> > {
> >   char * _5;
> >   char _7;
> > 
> >[local count: 118111600]:
> >   _7 = *p_3(D);
> >   if (_7 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> >   _5 = p_3(D) + 1;
> >   p_10 = .RAWMEMCHR (_5, 0);
> > 
> >[local count: 118111600]:
> >   # p_9 = PHI 
> >   return p_9;
> > 
> > }
> > 
> > which is fine so far.  However, I haven't made up my mind so far whether it 
> > is
> > worthwhile to spend more time in order to also eliminate the "first 
> > unrolling"
> > of the loop.  I gave it a shot by scheduling the pass prior pass copy header
> > and ended up with:
> > 
> > char * t (char * p)
> > {
> >[local count: 118111600]:
> >   p_5 = .RAWMEMCHR (p_3(D), 0);
> >   return p_5;
> > 
> > }
> > 
> > which seems optimal to me.  The downside of this is that I have to 
> > initialize
> > scalar evolution analysis which might be undesired that early.
> > 
> > All this brings me to the question where do you see this peace of code 
> > running?
> > If in a separate pass when would you schedule it?  If in an existing pass,
> > which one would you choose?
> > 
> > Another topic which came up is whether there exists a more elegant solution 
> > to
> > my current implementation in order to deal with stores (I'm speaking of the 
> > `if
> > (store_dr)` statement inside of function transform_loop_1).  For example,
> > 
> > extern char *p;
> > char *t ()
> > {
> >   for (; *p; ++p);
> >   return p;
> > }
> > 
> > ends up as
> > 
> > char * t ()
> > {
> >   char * _1;
> >   char * _2;
> >   char _3;
> >   char * p.1_8;
> >   char _9;
> >   char * p.1_10;
> >   char * p.1_11;
> > 
> >[local count: 118111600]:
> >   p.1_8 = p;
> >   _9 = *p.1_8;
> >   if (_9 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> > 
> >[local count: 955630225]:
> >   # p.1_10 = PHI <_1(6), p.1_8(5)>
> >   _1 = p.1_10 + 1;
> >   p = _1;
> >   _3 = *_1;
> >   if (_3 != 0)
> > goto ; [89.00%]
> >   else
> > goto ; [11.00%]
> > 
> >[local count: 105119324]:
> >   # _2 = PHI <_1(3)>
> >   goto ; [100.00%]
> > 
> >[local count: 850510901]:
> >   goto ; [100.00%]
> > 
> >[local count: 12992276]:
> > 
> >[local count: 118111600]:
> >   # p.1_11 = PHI <_2(8), p.1_8(7)>
> >   return p.1_11;
> > 
> > }
> > 
> > where inside the loop a load and store occurs.  For a rawmemchr like loop I
> > have to show that we never load from a memory location to which we write.
> > Currently I solve this by hard coding those facts which are not generic at 
> > all.
> > I gave compute_data_dependences_for_loop a try which failed to determine the
> > fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> > there are more generic solutions to express this in contrast to my current 
> > one?
> > 
> > Thanks again for your input so far.  Really appreciated.
> > 
> > Cheers,
> > Stefan
> 
> > diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> > index 8a5fb3fd99c..7b2d7405277 100644
> > --- a/gcc/Makefile.in
> > +++ b/gcc/Makefile.in
> > @@ -1608,6 +1608,7 @@ OBJS = \
> > tree-into-ssa.o \
> > tree-iterator.o \
> > tree-loop-distribution.o \
> > +   tree-loop-pattern.o \
> > tree-nested.o \
> >

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-04-08 Thread Stefan Schulze Frielinghaus via Gcc-patches
ping

On Tue, Mar 16, 2021 at 06:13:21PM +0100, Stefan Schulze Frielinghaus wrote:
> [snip]
> 
> Please find attached a new version of the patch.  A major change compared to
> the previous patch is that I created a separate pass which hopefully makes
> reviewing also easier since it is almost self-contained.  After realizing that
> detecting loops which mimic the behavior of rawmemchr/strlen functions does 
> not
> really fit into the topic of loop distribution, I created a separate pass.  
> Due
> to this I was also able to play around a bit and schedule the pass at 
> different
> times.  Currently it is scheduled right before loop distribution where loop
> header copying already took place which leads to the following effect.  
> Running
> this setup over
> 
> char *t (char *p)
> {
>   for (; *p; ++p);
>   return p;
> }
> 
> the new pass transforms
> 
> char * t (char * p)
> {
>   char _1;
>   char _7;
> 
>[local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
> 
>[local count: 955630225]:
>   # p_8 = PHI 
>   p_6 = p_8 + 1;
>   _1 = *p_6;
>   if (_1 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
>   # p_2 = PHI 
>   goto ; [100.00%]
> 
>[local count: 850510901]:
>   goto ; [100.00%]
> 
>[local count: 12992276]:
> 
>[local count: 118111600]:
>   # p_9 = PHI 
>   return p_9;
> 
> }
> 
> into
> 
> char * t (char * p)
> {
>   char * _5;
>   char _7;
> 
>[local count: 118111600]:
>   _7 = *p_3(D);
>   if (_7 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
>   _5 = p_3(D) + 1;
>   p_10 = .RAWMEMCHR (_5, 0);
> 
>[local count: 118111600]:
>   # p_9 = PHI 
>   return p_9;
> 
> }
> 
> which is fine so far.  However, I haven't made up my mind so far whether it is
> worthwhile to spend more time in order to also eliminate the "first unrolling"
> of the loop.  I gave it a shot by scheduling the pass prior pass copy header
> and ended up with:
> 
> char * t (char * p)
> {
>[local count: 118111600]:
>   p_5 = .RAWMEMCHR (p_3(D), 0);
>   return p_5;
> 
> }
> 
> which seems optimal to me.  The downside of this is that I have to initialize
> scalar evolution analysis which might be undesired that early.
> 
> All this brings me to the question where do you see this peace of code 
> running?
> If in a separate pass when would you schedule it?  If in an existing pass,
> which one would you choose?
> 
> Another topic which came up is whether there exists a more elegant solution to
> my current implementation in order to deal with stores (I'm speaking of the 
> `if
> (store_dr)` statement inside of function transform_loop_1).  For example,
> 
> extern char *p;
> char *t ()
> {
>   for (; *p; ++p);
>   return p;
> }
> 
> ends up as
> 
> char * t ()
> {
>   char * _1;
>   char * _2;
>   char _3;
>   char * p.1_8;
>   char _9;
>   char * p.1_10;
>   char * p.1_11;
> 
>[local count: 118111600]:
>   p.1_8 = p;
>   _9 = *p.1_8;
>   if (_9 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
> 
>[local count: 955630225]:
>   # p.1_10 = PHI <_1(6), p.1_8(5)>
>   _1 = p.1_10 + 1;
>   p = _1;
>   _3 = *_1;
>   if (_3 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
> 
>[local count: 105119324]:
>   # _2 = PHI <_1(3)>
>   goto ; [100.00%]
> 
>[local count: 850510901]:
>   goto ; [100.00%]
> 
>[local count: 12992276]:
> 
>[local count: 118111600]:
>   # p.1_11 = PHI <_2(8), p.1_8(7)>
>   return p.1_11;
> 
> }
> 
> where inside the loop a load and store occurs.  For a rawmemchr like loop I
> have to show that we never load from a memory location to which we write.
> Currently I solve this by hard coding those facts which are not generic at 
> all.
> I gave compute_data_dependences_for_loop a try which failed to determine the
> fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
> there are more generic solutions to express this in contrast to my current 
> one?
> 
> Thanks again for your input so far.  Really appreciated.
> 
> Cheers,
> Stefan

> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 8a5fb3fd99c..7b2d7405277 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1608,6 +1608,7 @@ OBJS = \
>   tree-into-ssa.o \
>   tree-iterator.o \
>   tree-loop-distribution.o \
> + tree-loop-pattern.o \
>   tree-nested.o \
>   tree-nrv.o \
>   tree-object-size.o \
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index dd7173126fb..957e96a46a4 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2917,6 +2917,33 @@ expand_VEC_CONVERT (internal_fn, gcall *)
>gcc_unreachable ();
>  }
>  
> +void
> +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> +{
> +  expand_operand ops[3];
> +
> +  tree lhs = gimple_call_lhs (stmt);
> +  if (!lhs)
> +return;
> +  tree lhs_type = TREE_TYPE (l

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-03-16 Thread Stefan Schulze Frielinghaus via Gcc-patches
[snip]

Please find attached a new version of the patch.  A major change compared to
the previous patch is that I created a separate pass which hopefully makes
reviewing also easier since it is almost self-contained.  After realizing that
detecting loops which mimic the behavior of rawmemchr/strlen functions does not
really fit into the topic of loop distribution, I created a separate pass.  Due
to this I was also able to play around a bit and schedule the pass at different
times.  Currently it is scheduled right before loop distribution where loop
header copying already took place which leads to the following effect.  Running
this setup over

char *t (char *p)
{
  for (; *p; ++p);
  return p;
}

the new pass transforms

char * t (char * p)
{
  char _1;
  char _7;

   [local count: 118111600]:
  _7 = *p_3(D);
  if (_7 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:

   [local count: 955630225]:
  # p_8 = PHI 
  p_6 = p_8 + 1;
  _1 = *p_6;
  if (_1 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  # p_2 = PHI 
  goto ; [100.00%]

   [local count: 850510901]:
  goto ; [100.00%]

   [local count: 12992276]:

   [local count: 118111600]:
  # p_9 = PHI 
  return p_9;

}

into

char * t (char * p)
{
  char * _5;
  char _7;

   [local count: 118111600]:
  _7 = *p_3(D);
  if (_7 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  _5 = p_3(D) + 1;
  p_10 = .RAWMEMCHR (_5, 0);

   [local count: 118111600]:
  # p_9 = PHI 
  return p_9;

}

which is fine so far.  However, I haven't made up my mind so far whether it is
worthwhile to spend more time in order to also eliminate the "first unrolling"
of the loop.  I gave it a shot by scheduling the pass prior pass copy header
and ended up with:

char * t (char * p)
{
   [local count: 118111600]:
  p_5 = .RAWMEMCHR (p_3(D), 0);
  return p_5;

}

which seems optimal to me.  The downside of this is that I have to initialize
scalar evolution analysis which might be undesired that early.

All this brings me to the question where do you see this peace of code running?
If in a separate pass when would you schedule it?  If in an existing pass,
which one would you choose?

Another topic which came up is whether there exists a more elegant solution to
my current implementation in order to deal with stores (I'm speaking of the `if
(store_dr)` statement inside of function transform_loop_1).  For example,

extern char *p;
char *t ()
{
  for (; *p; ++p);
  return p;
}

ends up as

char * t ()
{
  char * _1;
  char * _2;
  char _3;
  char * p.1_8;
  char _9;
  char * p.1_10;
  char * p.1_11;

   [local count: 118111600]:
  p.1_8 = p;
  _9 = *p.1_8;
  if (_9 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:

   [local count: 955630225]:
  # p.1_10 = PHI <_1(6), p.1_8(5)>
  _1 = p.1_10 + 1;
  p = _1;
  _3 = *_1;
  if (_3 != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  # _2 = PHI <_1(3)>
  goto ; [100.00%]

   [local count: 850510901]:
  goto ; [100.00%]

   [local count: 12992276]:

   [local count: 118111600]:
  # p.1_11 = PHI <_2(8), p.1_8(7)>
  return p.1_11;

}

where inside the loop a load and store occurs.  For a rawmemchr like loop I
have to show that we never load from a memory location to which we write.
Currently I solve this by hard coding those facts which are not generic at all.
I gave compute_data_dependences_for_loop a try which failed to determine the
fact that stores only happen to p[0] and loads from p[i] where i>0.  Maybe
there are more generic solutions to express this in contrast to my current one?

Thanks again for your input so far.  Really appreciated.

Cheers,
Stefan
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 8a5fb3fd99c..7b2d7405277 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1608,6 +1608,7 @@ OBJS = \
tree-into-ssa.o \
tree-iterator.o \
tree-loop-distribution.o \
+   tree-loop-pattern.o \
tree-nested.o \
tree-nrv.o \
tree-object-size.o \
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index dd7173126fb..957e96a46a4 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2917,6 +2917,33 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);
+  if (!lhs)
+return;
+  tree lhs_type = TREE_TYPE (lhs);
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  create_output_operand (&ops[0], lhs_rtx, TYPE_MODE (lhs_type));
+
+  for (unsigned int i = 0; i < 2; ++i)
+{
+  tree rhs = gimple_call_arg (stmt, i);
+  tree rhs_type = TREE_TYPE (rhs);
+  rtx rhs_rtx = expand_normal (rhs);
+  create_input_operand (&ops[i + 1], rhs_rtx, TYPE_MODE (rhs_type));
+}
+
+  insn_code icode = direct_optab_handler (rawmemchr_optab, ops[2].mode);
+
+  expand_insn (i

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-03-03 Thread Stefan Schulze Frielinghaus via Gcc-patches
On Tue, Mar 02, 2021 at 01:29:59PM +0100, Richard Biener wrote:
> On Sun, Feb 14, 2021 at 11:27 AM Stefan Schulze Frielinghaus
>  wrote:
> >
> > On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> > > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> > > Gcc-patches  wrote:
> > > >
> > > > This patch adds support for recognizing loops which mimic the behaviour
> > > > of function rawmemchr, and replaces those with an internal function call
> > > > in case a target provides them.  In contrast to the original rawmemchr
> > > > function, this patch also supports different instances where the memory
> > > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > > > respectively.
> > > >
> > > > This patch is not final and I'm looking for some feedback:
> > > >
> > > > Previously, only loops which mimic the behaviours of functions memset,
> > > > memcpy, and memmove have been detected and replaced by corresponding
> > > > function calls.  One characteristic of those loops/partitions is that
> > > > they don't have a reduction.  In contrast, loops which mimic the
> > > > behaviour of rawmemchr compute a result and therefore have a reduction.
> > > > My current attempt is to ensure that the reduction statement is not used
> > > > in any other partition and only in that case ignore the reduction and
> > > > replace the loop by a function call.  We then only need to replace the
> > > > reduction variable of the loop which contained the loop result by the
> > > > variable of the lhs of the internal function call.  This should ensure
> > > > that the transformation is correct independently of how partitions are
> > > > fused/distributed in the end.  Any thoughts about this?
> > >
> > > Currently we're forcing reduction partitions last (and force to have a 
> > > single
> > > one by fusing all partitions containing a reduction) because 
> > > code-generation
> > > does not properly update SSA form for the reduction results.  ISTR that
> > > might be just because we do not copy the LC PHI nodes or do not adjust
> > > them when copying.  That might not be an issue in case you replace the
> > > partition with a call.  I guess you can try to have a testcase with
> > > two rawmemchr patterns and a regular loop part that has to be scheduled
> > > inbetween both for correctness.
> >
> > Ah ok, in that case I updated my patch by removing the constraint that
> > the reduction statement must be in precisely one partition.  Please find
> > attached the testcases I came up so far.  Since transforming a loop into
> > a rawmemchr function call is backend dependend, I planned to include
> > those only in my backend patch.  I wasn't able to come up with any
> > testcase where a loop is distributed into multiple partitions and where
> > one is classified as a rawmemchr builtin.  The latter boils down to a
> > for loop with an empty body only in which case I suspect that loop
> > distribution shouldn't be done anyway.
> >
> > > > Furthermore, I simply added two new members (pattern, fn) to structure
> > > > builtin_info which I consider rather hacky.  For the long run I thought
> > > > about to split up structure builtin_info into a union where each member
> > > > is a structure for a particular builtin of a partition, i.e., something
> > > > like this:
> > > >
> > > > union builtin_info
> > > > {
> > > >   struct binfo_memset *memset;
> > > >   struct binfo_memcpymove *memcpymove;
> > > >   struct binfo_rawmemchr *rawmemchr;
> > > > };
> > > >
> > > > Such that a structure for one builtin does not get "polluted" by a
> > > > different one.  Any thoughts about this?
> > >
> > > Probably makes sense if the list of recognized patterns grow further.
> > >
> > > I see you use internal functions rather than builtin functions.  I guess
> > > that's OK.  But you use new target hooks for expansion where I think
> > > new optab entries similar to cmpmem would be more appropriate
> > > where the distinction between 8, 16 or 32 bits can be encoded in
> > > the modes.
> >
> > The optab implementation is really nice which allows me to use iterators
> > in the backend which in the end saves me some boiler plate code compared
> > to the previous implementation :)
> >
> > While using optabs now, I only require one additional member (pattern)
> > in the builtin_info struct.  Thus I didn't want to overcomplicate things
> > and kept the single struct approach as is.
> >
> > For the long run, should I resubmit this patch once stage 1 opens or how
> > would you propose to proceed?
> 
> Yes, and sorry for the delay.  Few comments on the patch given I had a
> quick look:
> 
> +void
> +expand_RAWMEMCHR (internal_fn, gcall *stmt)
> +{
> +  expand_operand ops[3];
> +
> +  tree lhs = gimple_call_lhs (stmt);
> 
> I think that give we have people testing with -fno-tree-dce you
> should try to handle a NULL LHS gracefully.  I suppose by
> simply doing nothing.
> 
> +  tree result_old = build_fold_addr_expr (DR_REF (d

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-03-02 Thread Richard Biener via Gcc-patches
On Sun, Feb 14, 2021 at 11:27 AM Stefan Schulze Frielinghaus
 wrote:
>
> On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> > Gcc-patches  wrote:
> > >
> > > This patch adds support for recognizing loops which mimic the behaviour
> > > of function rawmemchr, and replaces those with an internal function call
> > > in case a target provides them.  In contrast to the original rawmemchr
> > > function, this patch also supports different instances where the memory
> > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > > respectively.
> > >
> > > This patch is not final and I'm looking for some feedback:
> > >
> > > Previously, only loops which mimic the behaviours of functions memset,
> > > memcpy, and memmove have been detected and replaced by corresponding
> > > function calls.  One characteristic of those loops/partitions is that
> > > they don't have a reduction.  In contrast, loops which mimic the
> > > behaviour of rawmemchr compute a result and therefore have a reduction.
> > > My current attempt is to ensure that the reduction statement is not used
> > > in any other partition and only in that case ignore the reduction and
> > > replace the loop by a function call.  We then only need to replace the
> > > reduction variable of the loop which contained the loop result by the
> > > variable of the lhs of the internal function call.  This should ensure
> > > that the transformation is correct independently of how partitions are
> > > fused/distributed in the end.  Any thoughts about this?
> >
> > Currently we're forcing reduction partitions last (and force to have a 
> > single
> > one by fusing all partitions containing a reduction) because code-generation
> > does not properly update SSA form for the reduction results.  ISTR that
> > might be just because we do not copy the LC PHI nodes or do not adjust
> > them when copying.  That might not be an issue in case you replace the
> > partition with a call.  I guess you can try to have a testcase with
> > two rawmemchr patterns and a regular loop part that has to be scheduled
> > inbetween both for correctness.
>
> Ah ok, in that case I updated my patch by removing the constraint that
> the reduction statement must be in precisely one partition.  Please find
> attached the testcases I came up so far.  Since transforming a loop into
> a rawmemchr function call is backend dependend, I planned to include
> those only in my backend patch.  I wasn't able to come up with any
> testcase where a loop is distributed into multiple partitions and where
> one is classified as a rawmemchr builtin.  The latter boils down to a
> for loop with an empty body only in which case I suspect that loop
> distribution shouldn't be done anyway.
>
> > > Furthermore, I simply added two new members (pattern, fn) to structure
> > > builtin_info which I consider rather hacky.  For the long run I thought
> > > about to split up structure builtin_info into a union where each member
> > > is a structure for a particular builtin of a partition, i.e., something
> > > like this:
> > >
> > > union builtin_info
> > > {
> > >   struct binfo_memset *memset;
> > >   struct binfo_memcpymove *memcpymove;
> > >   struct binfo_rawmemchr *rawmemchr;
> > > };
> > >
> > > Such that a structure for one builtin does not get "polluted" by a
> > > different one.  Any thoughts about this?
> >
> > Probably makes sense if the list of recognized patterns grow further.
> >
> > I see you use internal functions rather than builtin functions.  I guess
> > that's OK.  But you use new target hooks for expansion where I think
> > new optab entries similar to cmpmem would be more appropriate
> > where the distinction between 8, 16 or 32 bits can be encoded in
> > the modes.
>
> The optab implementation is really nice which allows me to use iterators
> in the backend which in the end saves me some boiler plate code compared
> to the previous implementation :)
>
> While using optabs now, I only require one additional member (pattern)
> in the builtin_info struct.  Thus I didn't want to overcomplicate things
> and kept the single struct approach as is.
>
> For the long run, should I resubmit this patch once stage 1 opens or how
> would you propose to proceed?

Yes, and sorry for the delay.  Few comments on the patch given I had a
quick look:

+void
+expand_RAWMEMCHR (internal_fn, gcall *stmt)
+{
+  expand_operand ops[3];
+
+  tree lhs = gimple_call_lhs (stmt);

I think that give we have people testing with -fno-tree-dce you
should try to handle a NULL LHS gracefully.  I suppose by
simply doing nothing.

+  tree result_old = build_fold_addr_expr (DR_REF (dr));
+  tree result_new = copy_ssa_name (result_old);

I think you simply want

   tree result = make_ssa_name (ptr_type_node);

most definitely

+  imm_use_iterator iter;
+  gimple *stmt;
+  use_operand_p use_p;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, result_old)
+{

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-02-25 Thread Jeff Law via Gcc-patches



On 2/25/21 10:01 AM, Stefan Schulze Frielinghaus via Gcc-patches wrote:
> Ping
Given this is not a regression it needs to wait for gcc-12.
jeff



Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-02-25 Thread Stefan Schulze Frielinghaus via Gcc-patches
Ping

On Sun, Feb 14, 2021 at 11:27:40AM +0100, Stefan Schulze Frielinghaus wrote:
> On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> > On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> > Gcc-patches  wrote:
> > >
> > > This patch adds support for recognizing loops which mimic the behaviour
> > > of function rawmemchr, and replaces those with an internal function call
> > > in case a target provides them.  In contrast to the original rawmemchr
> > > function, this patch also supports different instances where the memory
> > > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > > respectively.
> > >
> > > This patch is not final and I'm looking for some feedback:
> > >
> > > Previously, only loops which mimic the behaviours of functions memset,
> > > memcpy, and memmove have been detected and replaced by corresponding
> > > function calls.  One characteristic of those loops/partitions is that
> > > they don't have a reduction.  In contrast, loops which mimic the
> > > behaviour of rawmemchr compute a result and therefore have a reduction.
> > > My current attempt is to ensure that the reduction statement is not used
> > > in any other partition and only in that case ignore the reduction and
> > > replace the loop by a function call.  We then only need to replace the
> > > reduction variable of the loop which contained the loop result by the
> > > variable of the lhs of the internal function call.  This should ensure
> > > that the transformation is correct independently of how partitions are
> > > fused/distributed in the end.  Any thoughts about this?
> > 
> > Currently we're forcing reduction partitions last (and force to have a 
> > single
> > one by fusing all partitions containing a reduction) because code-generation
> > does not properly update SSA form for the reduction results.  ISTR that
> > might be just because we do not copy the LC PHI nodes or do not adjust
> > them when copying.  That might not be an issue in case you replace the
> > partition with a call.  I guess you can try to have a testcase with
> > two rawmemchr patterns and a regular loop part that has to be scheduled
> > inbetween both for correctness.
> 
> Ah ok, in that case I updated my patch by removing the constraint that
> the reduction statement must be in precisely one partition.  Please find
> attached the testcases I came up so far.  Since transforming a loop into
> a rawmemchr function call is backend dependend, I planned to include
> those only in my backend patch.  I wasn't able to come up with any
> testcase where a loop is distributed into multiple partitions and where
> one is classified as a rawmemchr builtin.  The latter boils down to a
> for loop with an empty body only in which case I suspect that loop
> distribution shouldn't be done anyway.
> 
> > > Furthermore, I simply added two new members (pattern, fn) to structure
> > > builtin_info which I consider rather hacky.  For the long run I thought
> > > about to split up structure builtin_info into a union where each member
> > > is a structure for a particular builtin of a partition, i.e., something
> > > like this:
> > >
> > > union builtin_info
> > > {
> > >   struct binfo_memset *memset;
> > >   struct binfo_memcpymove *memcpymove;
> > >   struct binfo_rawmemchr *rawmemchr;
> > > };
> > >
> > > Such that a structure for one builtin does not get "polluted" by a
> > > different one.  Any thoughts about this?
> > 
> > Probably makes sense if the list of recognized patterns grow further.
> > 
> > I see you use internal functions rather than builtin functions.  I guess
> > that's OK.  But you use new target hooks for expansion where I think
> > new optab entries similar to cmpmem would be more appropriate
> > where the distinction between 8, 16 or 32 bits can be encoded in
> > the modes.
> 
> The optab implementation is really nice which allows me to use iterators
> in the backend which in the end saves me some boiler plate code compared
> to the previous implementation :)
> 
> While using optabs now, I only require one additional member (pattern)
> in the builtin_info struct.  Thus I didn't want to overcomplicate things
> and kept the single struct approach as is.
> 
> For the long run, should I resubmit this patch once stage 1 opens or how
> would you propose to proceed?
> 
> Thanks for your review so far!
> 
> Cheers,
> Stefan
> 
> > 
> > Richard.
> > 
> > > Cheers,
> > > Stefan
> > > ---
> > >  gcc/internal-fn.c|  42 ++
> > >  gcc/internal-fn.def  |   3 +
> > >  gcc/target-insns.def |   3 +
> > >  gcc/tree-loop-distribution.c | 257 ++-
> > >  4 files changed, 272 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > > index dd7173126fb..9cd62544a1a 100644
> > > --- a/gcc/internal-fn.c
> > > +++ b/gcc/internal-fn.c
> > > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
> > >gcc_unreac

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-02-14 Thread Stefan Schulze Frielinghaus via Gcc-patches
On Tue, Feb 09, 2021 at 09:57:58AM +0100, Richard Biener wrote:
> On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
> Gcc-patches  wrote:
> >
> > This patch adds support for recognizing loops which mimic the behaviour
> > of function rawmemchr, and replaces those with an internal function call
> > in case a target provides them.  In contrast to the original rawmemchr
> > function, this patch also supports different instances where the memory
> > pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> > respectively.
> >
> > This patch is not final and I'm looking for some feedback:
> >
> > Previously, only loops which mimic the behaviours of functions memset,
> > memcpy, and memmove have been detected and replaced by corresponding
> > function calls.  One characteristic of those loops/partitions is that
> > they don't have a reduction.  In contrast, loops which mimic the
> > behaviour of rawmemchr compute a result and therefore have a reduction.
> > My current attempt is to ensure that the reduction statement is not used
> > in any other partition and only in that case ignore the reduction and
> > replace the loop by a function call.  We then only need to replace the
> > reduction variable of the loop which contained the loop result by the
> > variable of the lhs of the internal function call.  This should ensure
> > that the transformation is correct independently of how partitions are
> > fused/distributed in the end.  Any thoughts about this?
> 
> Currently we're forcing reduction partitions last (and force to have a single
> one by fusing all partitions containing a reduction) because code-generation
> does not properly update SSA form for the reduction results.  ISTR that
> might be just because we do not copy the LC PHI nodes or do not adjust
> them when copying.  That might not be an issue in case you replace the
> partition with a call.  I guess you can try to have a testcase with
> two rawmemchr patterns and a regular loop part that has to be scheduled
> inbetween both for correctness.

Ah ok, in that case I updated my patch by removing the constraint that
the reduction statement must be in precisely one partition.  Please find
attached the testcases I came up so far.  Since transforming a loop into
a rawmemchr function call is backend dependend, I planned to include
those only in my backend patch.  I wasn't able to come up with any
testcase where a loop is distributed into multiple partitions and where
one is classified as a rawmemchr builtin.  The latter boils down to a
for loop with an empty body only in which case I suspect that loop
distribution shouldn't be done anyway.

> > Furthermore, I simply added two new members (pattern, fn) to structure
> > builtin_info which I consider rather hacky.  For the long run I thought
> > about to split up structure builtin_info into a union where each member
> > is a structure for a particular builtin of a partition, i.e., something
> > like this:
> >
> > union builtin_info
> > {
> >   struct binfo_memset *memset;
> >   struct binfo_memcpymove *memcpymove;
> >   struct binfo_rawmemchr *rawmemchr;
> > };
> >
> > Such that a structure for one builtin does not get "polluted" by a
> > different one.  Any thoughts about this?
> 
> Probably makes sense if the list of recognized patterns grow further.
> 
> I see you use internal functions rather than builtin functions.  I guess
> that's OK.  But you use new target hooks for expansion where I think
> new optab entries similar to cmpmem would be more appropriate
> where the distinction between 8, 16 or 32 bits can be encoded in
> the modes.

The optab implementation is really nice which allows me to use iterators
in the backend which in the end saves me some boiler plate code compared
to the previous implementation :)

While using optabs now, I only require one additional member (pattern)
in the builtin_info struct.  Thus I didn't want to overcomplicate things
and kept the single struct approach as is.

For the long run, should I resubmit this patch once stage 1 opens or how
would you propose to proceed?

Thanks for your review so far!

Cheers,
Stefan

> 
> Richard.
> 
> > Cheers,
> > Stefan
> > ---
> >  gcc/internal-fn.c|  42 ++
> >  gcc/internal-fn.def  |   3 +
> >  gcc/target-insns.def |   3 +
> >  gcc/tree-loop-distribution.c | 257 ++-
> >  4 files changed, 272 insertions(+), 33 deletions(-)
> >
> > diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> > index dd7173126fb..9cd62544a1a 100644
> > --- a/gcc/internal-fn.c
> > +++ b/gcc/internal-fn.c
> > @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
> >gcc_unreachable ();
> >  }
> >
> > +static void
> > +expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
> > +{
> > +  if (targetm.have_rawmemchr8 ())
> > +{
> > +  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, 
> > VOIDmode, EXPAND_WRITE);
> > +  rtx start = expand_normal (gimple_cal

Re: [RFC] ldist: Recognize rawmemchr loop patterns

2021-02-09 Thread Richard Biener via Gcc-patches
On Mon, Feb 8, 2021 at 3:11 PM Stefan Schulze Frielinghaus via
Gcc-patches  wrote:
>
> This patch adds support for recognizing loops which mimic the behaviour
> of function rawmemchr, and replaces those with an internal function call
> in case a target provides them.  In contrast to the original rawmemchr
> function, this patch also supports different instances where the memory
> pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
> respectively.
>
> This patch is not final and I'm looking for some feedback:
>
> Previously, only loops which mimic the behaviours of functions memset,
> memcpy, and memmove have been detected and replaced by corresponding
> function calls.  One characteristic of those loops/partitions is that
> they don't have a reduction.  In contrast, loops which mimic the
> behaviour of rawmemchr compute a result and therefore have a reduction.
> My current attempt is to ensure that the reduction statement is not used
> in any other partition and only in that case ignore the reduction and
> replace the loop by a function call.  We then only need to replace the
> reduction variable of the loop which contained the loop result by the
> variable of the lhs of the internal function call.  This should ensure
> that the transformation is correct independently of how partitions are
> fused/distributed in the end.  Any thoughts about this?

Currently we're forcing reduction partitions last (and force to have a single
one by fusing all partitions containing a reduction) because code-generation
does not properly update SSA form for the reduction results.  ISTR that
might be just because we do not copy the LC PHI nodes or do not adjust
them when copying.  That might not be an issue in case you replace the
partition with a call.  I guess you can try to have a testcase with
two rawmemchr patterns and a regular loop part that has to be scheduled
inbetween both for correctness.

> Furthermore, I simply added two new members (pattern, fn) to structure
> builtin_info which I consider rather hacky.  For the long run I thought
> about to split up structure builtin_info into a union where each member
> is a structure for a particular builtin of a partition, i.e., something
> like this:
>
> union builtin_info
> {
>   struct binfo_memset *memset;
>   struct binfo_memcpymove *memcpymove;
>   struct binfo_rawmemchr *rawmemchr;
> };
>
> Such that a structure for one builtin does not get "polluted" by a
> different one.  Any thoughts about this?

Probably makes sense if the list of recognized patterns grow further.

I see you use internal functions rather than builtin functions.  I guess
that's OK.  But you use new target hooks for expansion where I think
new optab entries similar to cmpmem would be more appropriate
where the distinction between 8, 16 or 32 bits can be encoded in
the modes.

Richard.

> Cheers,
> Stefan
> ---
>  gcc/internal-fn.c|  42 ++
>  gcc/internal-fn.def  |   3 +
>  gcc/target-insns.def |   3 +
>  gcc/tree-loop-distribution.c | 257 ++-
>  4 files changed, 272 insertions(+), 33 deletions(-)
>
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index dd7173126fb..9cd62544a1a 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
>gcc_unreachable ();
>  }
>
> +static void
> +expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
> +{
> +  if (targetm.have_rawmemchr8 ())
> +{
> +  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, 
> EXPAND_WRITE);
> +  rtx start = expand_normal (gimple_call_arg (stmt, 0));
> +  rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> +  emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
> +}
> +  else
> +gcc_unreachable();
> +}
> +
> +static void
> +expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
> +{
> +  if (targetm.have_rawmemchr16 ())
> +{
> +  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, 
> EXPAND_WRITE);
> +  rtx start = expand_normal (gimple_call_arg (stmt, 0));
> +  rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> +  emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
> +}
> +  else
> +gcc_unreachable();
> +}
> +
> +static void
> +expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
> +{
> +  if (targetm.have_rawmemchr32 ())
> +{
> +  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, 
> EXPAND_WRITE);
> +  rtx start = expand_normal (gimple_call_arg (stmt, 0));
> +  rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
> +  emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
> +}
> +  else
> +gcc_unreachable();
> +}
> +
>  /* Expand the IFN_UNIQUE function according to its first argument.  */
>
>  static void
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index daeace7a34e..34247859704 100644
> --- a/gcc/internal-fn.

[RFC] ldist: Recognize rawmemchr loop patterns

2021-02-08 Thread Stefan Schulze Frielinghaus via Gcc-patches
This patch adds support for recognizing loops which mimic the behaviour
of function rawmemchr, and replaces those with an internal function call
in case a target provides them.  In contrast to the original rawmemchr
function, this patch also supports different instances where the memory
pointed to and the pattern are interpreted as 8, 16, and 32 bit sized,
respectively.

This patch is not final and I'm looking for some feedback:

Previously, only loops which mimic the behaviours of functions memset,
memcpy, and memmove have been detected and replaced by corresponding
function calls.  One characteristic of those loops/partitions is that
they don't have a reduction.  In contrast, loops which mimic the
behaviour of rawmemchr compute a result and therefore have a reduction.
My current attempt is to ensure that the reduction statement is not used
in any other partition and only in that case ignore the reduction and
replace the loop by a function call.  We then only need to replace the
reduction variable of the loop which contained the loop result by the
variable of the lhs of the internal function call.  This should ensure
that the transformation is correct independently of how partitions are
fused/distributed in the end.  Any thoughts about this?

Furthermore, I simply added two new members (pattern, fn) to structure
builtin_info which I consider rather hacky.  For the long run I thought
about to split up structure builtin_info into a union where each member
is a structure for a particular builtin of a partition, i.e., something
like this:

union builtin_info
{
  struct binfo_memset *memset;
  struct binfo_memcpymove *memcpymove;
  struct binfo_rawmemchr *rawmemchr;
};

Such that a structure for one builtin does not get "polluted" by a
different one.  Any thoughts about this?

Cheers,
Stefan
---
 gcc/internal-fn.c|  42 ++
 gcc/internal-fn.def  |   3 +
 gcc/target-insns.def |   3 +
 gcc/tree-loop-distribution.c | 257 ++-
 4 files changed, 272 insertions(+), 33 deletions(-)

diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index dd7173126fb..9cd62544a1a 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2917,6 +2917,48 @@ expand_VEC_CONVERT (internal_fn, gcall *)
   gcc_unreachable ();
 }
 
+static void
+expand_RAWMEMCHR8 (internal_fn, gcall *stmt)
+{
+  if (targetm.have_rawmemchr8 ())
+{
+  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, 
EXPAND_WRITE);
+  rtx start = expand_normal (gimple_call_arg (stmt, 0));
+  rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
+  emit_insn (targetm.gen_rawmemchr8 (result, start, pattern));
+}
+  else
+gcc_unreachable();
+}
+
+static void
+expand_RAWMEMCHR16 (internal_fn, gcall *stmt)
+{
+  if (targetm.have_rawmemchr16 ())
+{
+  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, 
EXPAND_WRITE);
+  rtx start = expand_normal (gimple_call_arg (stmt, 0));
+  rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
+  emit_insn (targetm.gen_rawmemchr16 (result, start, pattern));
+}
+  else
+gcc_unreachable();
+}
+
+static void
+expand_RAWMEMCHR32 (internal_fn, gcall *stmt)
+{
+  if (targetm.have_rawmemchr32 ())
+{
+  rtx result = expand_expr (gimple_call_lhs (stmt), NULL_RTX, VOIDmode, 
EXPAND_WRITE);
+  rtx start = expand_normal (gimple_call_arg (stmt, 0));
+  rtx pattern = expand_normal (gimple_call_arg (stmt, 1));
+  emit_insn (targetm.gen_rawmemchr32 (result, start, pattern));
+}
+  else
+gcc_unreachable();
+}
+
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index daeace7a34e..34247859704 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -348,6 +348,9 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | 
ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (VEC_CONVERT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR8, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR16, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (RAWMEMCHR32, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
a CFG property in the face of jump threading, tail merging or
diff --git a/gcc/target-insns.def b/gcc/target-insns.def
index 672c35698d7..9248554cbf3 100644
--- a/gcc/target-insns.def
+++ b/gcc/target-insns.def
@@ -106,3 +106,6 @@ DEF_TARGET_INSN (trap, (void))
 DEF_TARGET_INSN (unique, (void))
 DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
+DEF_TARGET_INSN (rawmemchr8, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (rawmemchr16, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (rawmemchr32, (rtx x0, rtx x1