Re: [RFC] Introduce -finline-memset-loops

Richard Biener via Gcc-patches Thu, 19 Jan 2023 03:58:52 -0800

On Thu, Jan 19, 2023 at 12:25 PM Alexandre Oliva <ol...@adacore.com> wrote:
>
> On Jan 16, 2023, Richard Biener <richard.guent...@gmail.com> wrote:
>
> > On Sat, Jan 14, 2023 at 2:55 AM Alexandre Oliva <ol...@adacore.com> wrote:
> >> Target-specific code is great for tight optimizations, but the main
> >> purpose of this feature is not an optimization.  AFAICT it actually
> >> slows things down in general (due to code growth, and to conservative
> >> assumptions about alignment), except perhaps for some microbenchmarks.
> >> It's rather a means to avoid depending on the C runtime, particularly
> >> due to compiler-introduced memset calls.
>
> > OK, that's what I guessed but you didn't spell out.  So does it make sense
> > to mention -ffreestanding in the docs at least?  My fear is that we'd get
> > complaints that -O3 -finline-memset-loops turns nicely optimized memset
> > loops into dumb ones (via loop distribution and then stupid re-expansion).
> > So does it also make sense to turn off -floop-distribute-patterns[-memset]
> > with -finline-memset-loops?
>
> I don't think they should be tied together.  Verbose as it is, the
> expansion of memset is a sort of local optimum given what the compiler
> knows about length and alignment: minimizing what can be minimized,
> namely the compare count, by grouping stores in large straight-line
> blocks.
>
> Though an optimized memset could in theory perform better, whether
> through ifuncs or by bumping alignment, if you're tuning generated code
> for a specific target, and you know dest is aligned, the generated code
> can likely beat a general-purpose optimized memset, even if by a thin
> margin, such as the code that the general-purpose memset would have to
> run to detect the alignment and realize it doesn't need to be bumped,
> and to extend the byte value to be stored to wider modes.
>
> So I can envision cases in which -floop-distribute-patterns could turn
> highly inefficient stores into a memset with known-strict alignment and
> length multiplier, that could then be profitably expanded inline so as
> to take advantage of both for performance reasons.
>
> Indeed, when I started working on this, I thought the issue was
> performance, and this led me to pursue the store-by-multiple-pieces
> logic.  It can indeed bring about performance improvements, both over
> generic-target and highly-optimized memset implementations.  But it can
> also be put to use to avoid C runtime calls.  So while I wouldn't
> suggest enabling it by default at any optimization level, I wouldn't tie
> it to the single purpose of freestanding environments either.
>
>
> >> My initial goal was to be able to show that inline expansion would NOT
> >> bring about performance improvements, but performance was not the
> >> concern that led to the request.
> >>
> >> If the approach seems generally acceptable, I may even end up extending
> >> it to other such builtins.  I have a vague recollection that memcmp is
> >> also an issue for us.
>
> > The C/C++ runtime produce at least memmove, memcpy and memcmp as well.
>
> *nod*.  The others are far more challenging to expand inline in a way
> that could potentially be more performant:
>
> - memcmp can only do by_pieces when testing for equality, presumably
> because grouping multiple bytes into larger words to speed things up
> won't always get you the expected result if you just subtract the larger
> words, endianness reversal prior to subtracting might be required, which
> would harm performance.  I don't see that using similar
> power-of-two-sizes grouping strategies to minimize looping overheads
> would be so advantageous, if at all, given the need for testing and
> branching at every word.
>
> - memcpy seems doable, but all of the block moves other than cpymem
> assume non-overlapping memcpy.  Even if we were to output a test for
> overlap that a naïve expansion would break, and an alternate block to go
> backwards, all of the block copying logic would have to be "oriented" to
> proceed explicitly forward, backward, or don't-care, where currently we
> only have don't-care.
>
> Though my initial plan, when posting this patch, was to see how well the
> general approach was received, before thinking much about how to apply
> it to the other builtins, now I am concerned that extending it to them
> is more than I can tackle.
>
> Would it make more sense to extend it, even constrained by the
> limitations mentioned above, or handle memset only?  In the latter case,
> would it still make sense to adopt a command-line option that suggests a
> broader effect than it already has, even if it's only a hopeful future
> extension?  -finline-all-stringops[={memset,memcpy,...}], that you
> suggested, seems to be a reasonable and extensible one to adopt.


Well, if the main intent is to avoid relying on a C runtime for calls
generated by the compiler then yes!  Otherwise it would be incomplete.
In that light ...

> >> Is (optionally) tending to this (uncommon, I suppose) need (or
> >> preference?) not something GCC would like to do?
>
> > Sure, I think for the specific intended purpose that would be fine.
>
> Cool!
>
> > It should also only apply to __builtin_memset calls, not to memset
> > calls from user code?
>
> I suppose it could be argued both ways.  The situations that I had in
> mind either already are or could be made __builtin_memset calls, but I
> can't think of reasons to prevent explicit memset calls from such
> expansion.  Do you have any specific purpose in mind?

... when the user writes memcpy () then he's supposed to provide
a definition, even with -ffreestanding.  But one could argue that with
-ffreestanding the compiler is responsible to provide memcpy () if
it itself is introducing calls to it.

> In favor of allowing non-__builtin_ memset to be so expanded, I'll
> mention that I caught a number of corner cases while developing the
> following improved version of the earlier patch (now handling even
> large-constant lengths) by bootstrapping the compiler with the option
> enabled by default.  Without that, testcases would have to be a *lot*
> more thorough.

I can see that, the question is really what the option is targeted at.
For *-linux at least I can't see a reason to use it - performance
cannot be it because glibc is decent.

So the question really boils down to the intended audience.  If
it's possibly extended to cover strcpy then for -ffreestanding
handling user calls to the functions might make sense, but
where do we stop there?

Richard.

>
> Introduce -finline-memset-loops
>
> From: Alexandre Oliva <ol...@adacore.com>
>
> try_store_by_multiple_pieces was added not long ago, enabling
> variable-sized memset to be expanded inline when the worst-case
> in-range constant length would, using conditional blocks with powers
> of two to cover all possibilities of length and alignment.
>
> This patch extends the memset expansion to start with a loop, so as to
> still take advantage of known alignment even with long lengths, but
> without necessarily adding store blocks for every power of two.
>
> This makes it possible for any memset call to be expanded, even if
> storing a single byte per iteration.  Surely efficient implementations
> of memset can do better, with a pre-loop to increase alignment, but
> that would likely be excessive for inline expansions of memset.
>
> Still, in some cases, users prefer to inline memset, even if it's not
> as performant, or when it's known to be performant in ways the
> compiler can't tell, to avoid depending on a C runtime library.
>
> With this flag, global or per-function optimizations may enable inline
> expansion of memset to be selectively requested, while the
> infrastructure for that may enable us to introduce per-target tuning
> to enable such looping even advantageous, even if not explicitly
> requested.
>
>
> for  gcc/ChangeLog
>
>         * builtins.cc (try_store_by_multiple_pieces): Support starting
>         with a loop.
>         * common.opt (finline-memset-loops): New.
>         * doc/invoke.texi (-finline-memset-loops): Add.
>
> for  gcc/testsuite/ChangeLog
>
>         * gcc.dg/torture/inline-mem-set-1.c: New.
> ---
>  gcc/builtins.cc                                 |   99 
> +++++++++++++++++++++--
>  gcc/common.opt                                  |    4 +
>  gcc/doc/invoke.texi                             |   13 +++
>  gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c |   84 ++++++++++++++++++++
>  4 files changed, 192 insertions(+), 8 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
>
> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> index af45829875e6c..733fe17ede622 100644
> --- a/gcc/builtins.cc
> +++ b/gcc/builtins.cc
> @@ -4322,6 +4322,10 @@ try_store_by_multiple_pieces (rtx to, rtx len, 
> unsigned int ctz_len,
>    int tst_bits = (max_bits != min_bits ? max_bits
>                   : floor_log2 (max_len ^ min_len));
>
> +  /* Save the pre-blksize values.  */
> +  int orig_max_bits = max_bits;
> +  int orig_tst_bits = tst_bits;
> +
>    /* Check whether it's profitable to start by storing a fixed BLKSIZE
>       bytes, to lower max_bits.  In the unlikely case of a constant LEN
>       (implied by identical MAX_LEN and MIN_LEN), we want to issue a
> @@ -4361,9 +4365,70 @@ try_store_by_multiple_pieces (rtx to, rtx len, 
> unsigned int ctz_len,
>    if (max_bits >= 0)
>      xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
>                 - (HOST_WIDE_INT_1U << ctz_len));
> -  if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
> -                           &valc, align, true))
> -    return false;
> +  bool max_loop = false;
> +  /* Skip the test in case of overflow in xlenest.  It shouldn't
> +     happen because of the way max_bits and blksize are related, but
> +     it doesn't hurt to test.  */
> +  if (blksize > xlenest
> +      || !can_store_by_pieces (xlenest, builtin_memset_read_str,
> +                              &valc, align, true))
> +    {
> +      if (!flag_inline_memset_loops)
> +       return false;
> +
> +      for (max_bits = orig_max_bits;
> +          max_bits >= sctz_len;
> +          --max_bits)
> +       {
> +         xlenest = ((HOST_WIDE_INT_1U << max_bits) * 2
> +                    - (HOST_WIDE_INT_1U << ctz_len));
> +         /* Check that blksize plus the bits to be stored as blocks
> +            sized at powers of two can be stored by pieces.  This is
> +            like the test above, but with smaller max_bits.  Skip
> +            orig_max_bits (it would be redundant).  Also skip in case
> +            of overflow.  */
> +         if (max_bits < orig_max_bits
> +             && xlenest + blksize >= xlenest
> +             && can_store_by_pieces (xlenest + blksize,
> +                                     builtin_memset_read_str,
> +                                     &valc, align, true))
> +           {
> +             max_loop = true;
> +             break;
> +           }
> +         if (blksize
> +             && can_store_by_pieces (xlenest,
> +                                     builtin_memset_read_str,
> +                                     &valc, align, true))
> +           {
> +             max_len += blksize;
> +             min_len += blksize;
> +             tst_bits = orig_tst_bits;
> +             blksize = 0;
> +             max_loop = true;
> +             break;
> +           }
> +         if (max_bits == sctz_len)
> +           {
> +             --sctz_len;
> +             --ctz_len;
> +           }
> +       }
> +      if (!max_loop)
> +       return false;
> +      /* If the boundaries are such that min and max may run a
> +        different number of trips in the initial loop, the remainder
> +        needs not be between the moduli, so set tst_bits to cover all
> +        bits.  Otherwise, if the trip counts are the same, max_len
> +        has the common prefix, and the previously-computed tst_bits
> +        is usable.  */
> +      if (max_len >> max_bits > min_len >> max_bits)
> +       tst_bits = max_bits;
> +    }
> +  /* ??? Do we have to check that all powers of two lengths from
> +     max_bits down to ctz_len pass can_store_by_pieces?  As in, could
> +     it possibly be that xlenest passes while smaller power-of-two
> +     sizes don't?  */
>
>    by_pieces_constfn constfun;
>    void *constfundata;
> @@ -4405,7 +4470,9 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
> int ctz_len,
>       the least significant bit possibly set in the length.  */
>    for (int i = max_bits; i >= sctz_len; i--)
>      {
> +      rtx_code_label *loop_label = NULL;
>        rtx_code_label *label = NULL;
> +
>        blksize = HOST_WIDE_INT_1U << i;
>
>        /* If we're past the bits shared between min_ and max_len, expand
> @@ -4419,18 +4486,31 @@ try_store_by_multiple_pieces (rtx to, rtx len, 
> unsigned int ctz_len,
>                                    profile_probability::even ());
>         }
>        /* If we are at a bit that is in the prefix shared by min_ and
> -        max_len, skip this BLKSIZE if the bit is clear.  */
> -      else if ((max_len & blksize) == 0)
> +        max_len, skip the current BLKSIZE if the bit is clear, but do
> +        not skip the loop, even if it doesn't require
> +        prechecking.  */
> +      else if ((max_len & blksize) == 0
> +              && !(max_loop && i == max_bits))
>         continue;
>
> +      if (max_loop && i == max_bits)
> +       {
> +         loop_label = gen_label_rtx ();
> +         emit_label (loop_label);
> +         /* Since we may run this multiple times, don't assume we
> +            know anything about the offset.  */
> +         clear_mem_offset (to);
> +       }
> +
>        /* Issue a store of BLKSIZE bytes.  */
> +      bool update_needed = i != sctz_len || loop_label;
>        to = store_by_pieces (to, blksize,
>                             constfun, constfundata,
>                             align, true,
> -                           i != sctz_len ? RETURN_END : RETURN_BEGIN);
> +                           update_needed ? RETURN_END : RETURN_BEGIN);
>
>        /* Adjust REM and PTR, unless this is the last iteration.  */
> -      if (i != sctz_len)
> +      if (update_needed)
>         {
>           emit_move_insn (ptr, force_operand (XEXP (to, 0), NULL_RTX));
>           to = replace_equiv_address (to, ptr);
> @@ -4438,6 +4518,11 @@ try_store_by_multiple_pieces (rtx to, rtx len, 
> unsigned int ctz_len,
>           emit_move_insn (rem, force_operand (rem_minus_blksize, NULL_RTX));
>         }
>
> +      if (loop_label)
> +       emit_cmp_and_jump_insns (rem, GEN_INT (blksize), GE, NULL,
> +                                ptr_mode, 1, loop_label,
> +                                profile_probability::likely ());
> +
>        if (label)
>         {
>           emit_label (label);
> diff --git a/gcc/common.opt b/gcc/common.opt
> index d0371aec8db5f..5d28f054241ad 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1874,6 +1874,10 @@ finline-atomics
>  Common Var(flag_inline_atomics) Init(1) Optimization
>  Inline __atomic operations when a lock free instruction sequence is 
> available.
>
> +finline-memset-loops
> +Common Var(flag_inline_memset_loops) Init(0) Optimization
> +Inline memset even if it requires loops.
> +
>  fcf-protection
>  Common RejectNegative Alias(fcf-protection=,full)
>
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 631c00582bf85..8f8d52bbeef68 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -548,7 +548,8 @@ Objective-C and Objective-C++ Dialects}.
>  -fgcse-sm  -fhoist-adjacent-loads  -fif-conversion @gol
>  -fif-conversion2  -findirect-inlining @gol
>  -finline-functions  -finline-functions-called-once  -finline-limit=@var{n} 
> @gol
> --finline-small-functions -fipa-modref -fipa-cp  -fipa-cp-clone @gol
> +-finline-memset-loops -finline-small-functions @gol
> +-fipa-modref -fipa-cp  -fipa-cp-clone @gol
>  -fipa-bit-cp  -fipa-vrp  -fipa-pta  -fipa-profile  -fipa-pure-const @gol
>  -fipa-reference  -fipa-reference-addressable @gol
>  -fipa-stack-alignment  -fipa-icf  -fira-algorithm=@var{algorithm} @gol
> @@ -11961,6 +11962,16 @@ in its own right.
>  Enabled at levels @option{-O1}, @option{-O2}, @option{-O3} and @option{-Os},
>  but not @option{-Og}.
>
> +@item -finline-memset-loops
> +@opindex finline-memset-loops
> +Expand @code{memset} calls inline, even when the length is variable or
> +big enough as to require looping.  This may enable the compiler to take
> +advantage of known alignment and length multipliers, but it will often
> +generate code that is less efficient than performant implementations of
> +@code{memset}, and grow code size so much that even a less performant
> +@code{memset} may run faster due to better use of the code cache.  This
> +option is disabled by default.
> +
>  @item -fearly-inlining
>  @opindex fearly-inlining
>  Inline functions marked by @code{always_inline} and functions whose body 
> seems
> diff --git a/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c 
> b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
> new file mode 100644
> index 0000000000000..4de51df006ede
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
> @@ -0,0 +1,84 @@
> +/* { dg-do compile } */
> +/* { dg-options "-finline-memset-loops -gno-record-gcc-switches -fno-lto" } 
> */
> +
> +void *zero (unsigned long long (*p)[32], int n)
> +{
> +  return __builtin_memset (p, 0, n * sizeof (*p));
> +}
> +
> +void *ones (char (*p)[128], int n)
> +{
> +  return __builtin_memset (p, -1, n * sizeof (*p));
> +}
> +
> +void *opt2 (int *p, int i)
> +{
> +  return __builtin_memset (p, 0, (i ? 1024 : 2) * sizeof (*p));
> +}
> +
> +void *opt8 (int *p, int i)
> +{
> +  return __builtin_memset (p, 0, (i ? 1024 : 8) * sizeof (*p));
> +}
> +
> +void *opt32 (int *p, int i)
> +{
> +  return __builtin_memset (p, 0, (i ? 1024 : 32) * sizeof (*p));
> +}
> +
> +void *opt128 (int *p, int i)
> +{
> +  return __builtin_memset (p, 0, (i ? 1024 : 128) * sizeof (*p));
> +}
> +
> +void *opt512 (int *p, int i)
> +{
> +  return __builtin_memset (p, 0, (i ? 1024 : 512) * sizeof (*p));
> +}
> +
> +void *opt_primes (int *p, int i)
> +{
> +  return __builtin_memset (p, 0, (i ? 509 : 7) * sizeof (*p));
> +}
> +
> +void *opt_primes_blk (int *p, int i)
> +{
> +  return __builtin_memset (p, 0, (i ? 521 : 9) * sizeof (*p));
> +}
> +
> +void *huge (long (*p)[16384])
> +{
> +  return __builtin_memset (p, 0, sizeof (*p));
> +}
> +
> +void *hugep1 (long (*p)[16384+1])
> +{
> +  return __builtin_memset (p, 0, sizeof (*p));
> +}
> +
> +void *hugep4 (long (*p)[16384+4])
> +{
> +  return __builtin_memset (p, 0, sizeof (*p));
> +}
> +
> +void *hugep16 (long (*p)[16384+16])
> +{
> +  return __builtin_memset (p, 0, sizeof (*p));
> +}
> +
> +void *hugep64 (long (*p)[16384+64])
> +{
> +  return __builtin_memset (p, 0, sizeof (*p));
> +}
> +
> +void *hugep256 (long (*p)[16384+256])
> +{
> +  return __builtin_memset (p, 0, sizeof (*p));
> +}
> +
> +void *hugep1024p256p64p16p4p1 (long (*p)[16384+1024+64+16+4+1])
> +{
> +  return __builtin_memset (p, 0, sizeof (*p));
> +}
> +
> +/* { dg-final { scan-assembler-not "memset" } } */
>
>
> --
> Alexandre Oliva, happy hacker                https://FSFLA.org/blogs/lxo/
>    Free Software Activist                       GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about <https://stallmansupport.org>

Re: [RFC] Introduce -finline-memset-loops

Reply via email to