Re: [RFC] ldist: Recognize rawmemchr loop patterns

Tom de Vries via Gcc-patches Mon, 31 Jan 2022 05:17:14 -0800

On 9/17/21 10:08, Richard Biener via Gcc-patches wrote:

On Mon, Sep 13, 2021 at 4:53 PM Stefan Schulze Frielinghaus
<stefa...@linux.ibm.com> wrote:


On Mon, Sep 06, 2021 at 11:56:21AM +0200, Richard Biener wrote:

On Fri, Sep 3, 2021 at 10:01 AM Stefan Schulze Frielinghaus
<stefa...@linux.ibm.com> wrote:


On Fri, Aug 20, 2021 at 12:35:58PM +0200, Richard Biener wrote:
[...]


+  /* Handle strlen like loops.  */
+  if (store_dr == NULL
+      && integer_zerop (pattern)
+      && TREE_CODE (reduction_iv.base) == INTEGER_CST
+      && TREE_CODE (reduction_iv.step) == INTEGER_CST
+      && integer_onep (reduction_iv.step)
+      && (types_compatible_p (TREE_TYPE (reduction_var), size_type_node)
+         || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))))
+    {

I wonder what goes wrong with a larger or smaller wrapping IV type?
The iteration
only stops when you load a NUL and the increments just wrap along (you're
using the pointer IVs to compute the strlen result).  Can't you simply truncate?


I think truncation is enough as long as no overflow occurs in strlen or
strlen_using_rawmemchr.

For larger than size_type_node (actually larger than ptr_type_node would matter
I guess), the argument is that since pointer wrapping would be undefined anyway
the IV cannot wrap either.  Now, the correct check here would IMHO be

       TYPE_PRECISION (TREE_TYPE (reduction_var)) < TYPE_PRECISION
(ptr_type_node)
        || TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (pointer-iv-var))

?


Regarding the implementation which makes use of rawmemchr:

We can count at most PTRDIFF_MAX many bytes without an overflow.  Thus,
the maximal length we can determine of a string where each character has
size S is PTRDIFF_MAX / S without an overflow.  Since an overflow for
ptrdiff type is undefined we have to make sure that if an overflow
occurs, then an overflow occurs for reduction variable, too, and that
this is undefined, too.  However, I'm not sure anymore whether we want
to respect overflows in all cases.  If TYPE_PRECISION (ptr_type_node)
equals TYPE_PRECISION (ptrdiff_type_node) and an overflow occurs, then
this would mean that a single string consumes more than half of the
virtual addressable memory.  At least for architectures where
TYPE_PRECISION (ptrdiff_type_node) == 64 holds, I think it is reasonable
to neglect the case where computing pointer difference may overflow.
Otherwise we are talking about strings with lenghts of multiple
pebibytes.  For other architectures we might have to be more precise
and make sure that reduction variable overflows first and that this is
undefined.

Thus a conservative condition would be (I assumed that the size of any
integral type is a power of two which I'm not sure if this really holds;
IIRC the C standard requires only that the alignment is a power of two
but not necessarily the size so I might need to change this):

/* Compute precision (reduction_var) < (precision (ptrdiff_type) - 1 - log2 
(sizeof (load_type))
    or in other words return true if reduction variable overflows first
    and false otherwise.  */

static bool
reduction_var_overflows_first (tree reduction_var, tree load_type)
{
   unsigned precision_ptrdiff = TYPE_PRECISION (ptrdiff_type_node);
   unsigned precision_reduction_var = TYPE_PRECISION (TREE_TYPE 
(reduction_var));
   unsigned size_exponent = wi::exact_log2 (wi::to_wide (TYPE_SIZE_UNIT 
(load_type)));
   return wi::ltu_p (precision_reduction_var, precision_ptrdiff - 1 - 
size_exponent);
}

TYPE_PRECISION (ptrdiff_type_node) == 64
|| (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
     && reduction_var_overflows_first (reduction_var, load_type)

Regarding the implementation which makes use of strlen:

I'm not sure what it means if strlen is called for a string with a
length greater than SIZE_MAX.  Therefore, similar to the implementation
using rawmemchr where we neglect the case of an overflow for 64bit
architectures, a conservative condition would be:

TYPE_PRECISION (size_type_node) == 64
|| (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
     && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (size_type_node))

I still included the overflow undefined check for reduction variable in
order to rule out situations where the reduction variable is unsigned
and overflows as many times until strlen(,_using_rawmemchr) overflows,
too.  Maybe this is all theoretical nonsense but I'm afraid of uncommon
architectures.  Anyhow, while writing this down it becomes clear that
this deserves a comment which I will add once it becomes clear which way
to go.


I think all the arguments about objects bigger than half of the address-space
also are valid for 32bit targets and thus 32bit size_type_node (or
32bit pointer size).
I'm not actually sure what's the canonical type to check against, whether
it's size_type_node (Cs size_t), ptr_type_node (Cs void *) or sizetype (the
middle-end "offset" type used for all address computations).  For weird reasons
I'd lean towards 'sizetype' (for example some embedded targets have 24bit
pointers but 16bit 'sizetype').


Ok, for the strlen implementation I changed from size_type_node to
sizetype and assume that no overflow occurs for string objects bigger
than half of the address space for 32-bit targets and up:

   (TYPE_PRECISION (sizetype) >= TYPE_PRECISION (ptr_type_node) - 1
    && TYPE_PRECISION (ptr_type_node) >= 32)
   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
       && TYPE_PRECISION (reduction_var) <= TYPE_PRECISION (sizetype))

and similarly for the rawmemchr implementation:

   (TYPE_PRECISION (ptrdiff_type_node) == TYPE_PRECISION (ptr_type_node)
    && TYPE_PRECISION (ptrdiff_type_node) >= 32)
   || (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var))
       && reduction_var_overflows_first (reduction_var, load_type))


+      if (TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (reduction_var)))
+       {
+         const char *msg = G_("assuming signed overflow does not occur "
+                              "when optimizing strlen like loop");
+         fold_overflow_warning (msg, WARN_STRICT_OVERFLOW_MISC);
+       }

no, please don't add any new strict-overflow warnings ;)


I just stumbled over code which produces such a warning and thought this
is a hard requirement :D The new patch doesn't contain it anymore.


The generate_*_builtin routines need some factoring - if you code-generate
into a gimple_seq you could use gimple_build () which would do the fold_stmt
(not sure why you do that - you should see to fold the call, not necessarily
the rest).  The replacement of reduction_var and the dumping could be shared.
There's also GET_MODE_NAME for the printing.


I wasn't really sure which way to go.  Use a gsi, as it is done by
existing generate_* functions, or make use of gimple_seq.  Since the
latter uses internally also gsi I thought it is better to stick to gsi
in the first place.  Now, after changing to gimple_seq I see the beauty
of it :)

I created two helper functions generate_strlen_builtin_1 and
generate_reduction_builtin_1 in order to reduce code duplication.

In function generate_strlen_builtin I changed from using
builtin_decl_implicit (BUILT_IN_STRLEN) to builtin_decl_explicit
(BUILT_IN_STRLEN) since the former could return a NULL pointer. I'm not
sure whether my intuition about the difference between implicit and
explicit builtins is correct.  In builtins.def there is a small example
given which I would paraphrase as "use builtin_decl_explicit if the
semantics of the builtin is defined by the C standard; otherwise use
builtin_decl_implicit" but probably my intuition is wrong?

Beside that I'm not sure whether I really have to call
build_fold_addr_expr which looks superfluous to me since
gimple_build_call can deal with ADDR_EXPR as well as FUNCTION_DECL:

tree fn = build_fold_addr_expr (builtin_decl_explicit (BUILT_IN_STRLEN));
gimple *fn_call = gimple_build_call (fn, 1, mem);

However, since it is also used that way in the context of
generate_memset_builtin I didn't remove it so far.

I think overall the approach is sound now but the details still need work.


Once again thank you very much for your review.  Really appreciated!


The patch lacks a changelog entry / description.  It's nice if patches sent
out for review are basically the rev as git format-patch produces.

The rawmemchr optab needs documenting in md.texi


While writing the documentation in md.texi I realised that other
instructions expect an address to be a memory operand which is not the
case for rawmemchr currently. At the moment the address is either an
SSA_NAME or ADDR_EXPR with a tree pointer type in expand_RAWMEMCHR. As a
consequence in the backend define_expand rawmemchr<mode> expects a
register operand and not a memory operand. Would it make sense to build
a MEM_REF out of SSA_NAME/ADDR_EXPR in expand_RAWMEMCHR? Not sure if
MEM_REF is supposed to be the canonical form here.


I suppose the expander could use code similar to what
expand_builtin_memset_args does,
using get_memory_rtx.  I suppose that we're using MEM operands because those
can convey things like alias info or alignment info, something which
REG operands cannot
(easily).  I wouldn't build a MEM_REF and try to expand that.


The new patch contains the following changes:

- In expand_RAWMEMCHR I'm using get_memory_rtx now.  This means I had to
   change linkage of get_memory_rtx to extern.

- In function generate_strlen_builtin_using_rawmemchr I'm not
   reconstructing the load type anymore from the base pointer but rather
   pass it as a parameter from function transform_reduction_loop where we
   also ensured that it is of integral type.  Reconstructing the load
   type was error prone since e.g. I didn't distinct between
   pointer_plus_expr or addr_expr.  Thus passing the load type should be
   more solid.

Regtested on IBM Z and x86.  Ok for mainline?


OK, and sorry for all the repeated delays.

I'm running into PR56888 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888 ) on nvptx due tothis, f.i. in gcc/testsuite/gcc.c-torture/execute/builtins/strlen.c,where gcc/testsuite/gcc.c-torture/execute/builtins/lib/strlen.c containsa strlen function, with a strlen loop, which is transformed bypass_loop_distribution into a __builtin_strlen, which is then expandedinto a strlen call, creating a self-recursive function. [ And on nvptx,that happens to result in a compilation failure, which is how I foundthis. ]

According to this (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888#c21 ) comment:

...
-fno-tree-loop-distribute-patterns is the reliable way to not
transform loops into library calls.
...

Then should we have something along the lines of:
...
$ git diff
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 6fe59cd56855..9a211d30cd7e 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -3683,7 +3683,11 @@ loop_distribution::transform_reduction_loop
               && TYPE_PRECISION (ptr_type_node) >= 32)
              || (TYPE_OVERFLOW_UNDEFINED (reduction_var_type)

&& TYPE_PRECISION (reduction_var_type) <=TYPE_PRECISION (sizetype)))

-         && builtin_decl_implicit (BUILT_IN_STRLEN))
+         && builtin_decl_implicit (BUILT_IN_STRLEN)
+         && flag_tree_loop_distribute_patterns)
        generate_strlen_builtin (loop, reduction_var, load_iv.base,
                                 reduction_iv.base, loc);

else if (direct_optab_handler (rawmemchr_optab, TYPE_MODE(load_type))

...
?

Or is the comment no longer valid?

Thanks,
- Tom

Re: [RFC] ldist: Recognize rawmemchr loop patterns

Reply via email to