Re: Introduce -finline-stringops (was: Re: [RFC] Introduce -finline-memset-loops)

2023-12-11 Thread Sam James


Alexandre Oliva via Gcc-patches  writes:

> On Jun  2, 2023, Alexandre Oliva  wrote:
>
>> Introduce -finline-stringops
>
> Ping?  https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620472.html

Should the docs for the x86-specific -minline-all-stringops refer to
the new -finline-stringops?

thanks,
sam


Introduce -finline-stringops (was: Re: [RFC] Introduce -finline-memset-loops)

2023-06-22 Thread Alexandre Oliva via Gcc-patches
On Jun  2, 2023, Alexandre Oliva  wrote:

> Introduce -finline-stringops

Ping?  https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620472.html

-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about 


Re: [RFC] Introduce -finline-memset-loops

2023-06-02 Thread Fangrui Song via Gcc-patches
On Fri, Jun 2, 2023 at 3:11 AM Alexandre Oliva via Gcc-patches
 wrote:
>
> On Jan 19, 2023, Alexandre Oliva  wrote:
>
> > Would it make more sense to extend it, even constrained by the
> > limitations mentioned above, or handle memset only?  In the latter case,
> > would it still make sense to adopt a command-line option that suggests a
> > broader effect than it already has, even if it's only a hopeful future
> > extension?  -finline-all-stringops[={memset,memcpy,...}], that you
> > suggested, seems to be a reasonable and extensible one to adopt.
>
> I ended up implementing all of memset, memcpy, memmove, and memcmp:
>
> Introduce -finline-stringops
>
> try_store_by_multiple_pieces was added not long ago, enabling
> variable-sized memset to be expanded inline when the worst-case
> in-range constant length would, using conditional blocks with powers
> of two to cover all possibilities of length and alignment.
>
> This patch introduces -finline-stringops[=fn] to request expansions to
> start with a loop, so as to still take advantage of known alignment
> even with long lengths, but without necessarily adding store blocks
> for every power of two.
>
> This makes it possible for the supported stringops (memset, memcpy,
> memmove, memset) to be expanded, even if storing a single byte per
> iteration.  Surely efficient implementations can run faster, with a
> pre-loop to increase alignment, but that would likely be excessive for
> inline expansions.
>
> Still, in some cases, such as in freestanding environments, users
> prefer to inline such stringops, especially those that the compiler
> may introduce itself, even if the expansion is not as performant as a
> highly optimized C library implementation could be, to avoid
> depending on a C runtime library.
>
> Regstrapped on x86_64-linux-gnu, also bootstrapped with
> -finline-stringops enabled by default, and tested with arm, aarch, 32-
> and 64-bit riscv with gcc-12.  Ok to install?
>[...]

This seems to be related to Clang's __builtin_mem{set,cpy}_inline . I
just created
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110094 ("Support
__builtin_mem*_inline").


Re: [RFC] Introduce -finline-memset-loops

2023-06-02 Thread Alexandre Oliva via Gcc-patches
On Jan 19, 2023, Alexandre Oliva  wrote:

> Would it make more sense to extend it, even constrained by the
> limitations mentioned above, or handle memset only?  In the latter case,
> would it still make sense to adopt a command-line option that suggests a
> broader effect than it already has, even if it's only a hopeful future
> extension?  -finline-all-stringops[={memset,memcpy,...}], that you
> suggested, seems to be a reasonable and extensible one to adopt.

I ended up implementing all of memset, memcpy, memmove, and memcmp:

Introduce -finline-stringops

try_store_by_multiple_pieces was added not long ago, enabling
variable-sized memset to be expanded inline when the worst-case
in-range constant length would, using conditional blocks with powers
of two to cover all possibilities of length and alignment.

This patch introduces -finline-stringops[=fn] to request expansions to
start with a loop, so as to still take advantage of known alignment
even with long lengths, but without necessarily adding store blocks
for every power of two.

This makes it possible for the supported stringops (memset, memcpy,
memmove, memset) to be expanded, even if storing a single byte per
iteration.  Surely efficient implementations can run faster, with a
pre-loop to increase alignment, but that would likely be excessive for
inline expansions.

Still, in some cases, such as in freestanding environments, users
prefer to inline such stringops, especially those that the compiler
may introduce itself, even if the expansion is not as performant as a
highly optimized C library implementation could be, to avoid
depending on a C runtime library.

Regstrapped on x86_64-linux-gnu, also bootstrapped with
-finline-stringops enabled by default, and tested with arm, aarch, 32-
and 64-bit riscv with gcc-12.  Ok to install?


for  gcc/ChangeLog

* expr.cc (emit_block_move_hints): Take ctz of len.  Obey
-finline-stringops.  Use oriented or sized loop.
(emit_block_move): Take ctz of len, and pass it on.
(emit_block_move_via_sized_loop): New.
(emit_block_move_via_oriented_loop): New.
(emit_block_move_via_loop): Take incr.  Move an incr-sized
block per iteration.
(emit_block_cmp_via_cmpmem): Take ctz of len.  Obey
-finline-stringops.
(emit_block_cmp_via_loop): New.
* expr.h (emit_block_move): Add ctz of len defaulting to zero.
(emit_block_move_hints): Likewise.
(emit_block_cmp_hints): Likewise.
* builtins.cc (expand_builtin_memory_copy_args): Pass ctz of
len to emit_block_move_hints.
(try_store_by_multiple_pieces): Support starting with a loop.
(expand_builtin_memcmp): Pass ctz of len to
emit_block_cmp_hints.
(expand_builtin): Allow inline expansion of memset, memcpy,
memmove and memcmp if requested.
* common.opt (finline-stringops): New.
(ilsop_fn): New enum.
* flag-types.h (enum ilsop_fn): New.
* doc/invoke.texi (-finline-stringops): Add.

for  gcc/testsuite/ChangeLog

* gcc.dg/torture/inline-mem-cmp-1.c: New.
* gcc.dg/torture/inline-mem-cpy-1.c: New.
* gcc.dg/torture/inline-mem-cpy-cmp-1.c: New.
* gcc.dg/torture/inline-mem-move-1.c: New.
* gcc.dg/torture/inline-mem-set-1.c: New.
---
 gcc/builtins.cc|  114 ++
 gcc/common.opt |   34 ++
 gcc/doc/invoke.texi|   15 +
 gcc/expr.cc|  374 +++-
 gcc/expr.h |9 
 gcc/flag-types.h   |   11 +
 gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c|7 
 gcc/testsuite/gcc.dg/torture/inline-mem-cpy-1.c|8 
 .../gcc.dg/torture/inline-mem-cpy-cmp-1.c  |   11 +
 gcc/testsuite/gcc.dg/torture/inline-mem-move-1.c   |9 
 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c|   84 
 11 files changed, 646 insertions(+), 30 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-cmp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-cpy-1.c
 create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-cpy-cmp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-move-1.c
 create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 8400adaf5b4db..1beaa4eae97a5 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -3769,7 +3769,7 @@ expand_builtin_memory_copy_args (tree dest, tree src, 
tree len,
 expected_align, expected_size,
 min_size, max_size, probable_max_size,
 use_mempcpy_call, _move_done,
-might_overlap);
+might_overlap, tree_ctz (len));
 
   /* Bail 

Re: [RFC] Introduce -finline-memset-loops

2023-01-19 Thread Richard Biener via Gcc-patches
On Thu, Jan 19, 2023 at 12:25 PM Alexandre Oliva  wrote:
>
> On Jan 16, 2023, Richard Biener  wrote:
>
> > On Sat, Jan 14, 2023 at 2:55 AM Alexandre Oliva  wrote:
> >> Target-specific code is great for tight optimizations, but the main
> >> purpose of this feature is not an optimization.  AFAICT it actually
> >> slows things down in general (due to code growth, and to conservative
> >> assumptions about alignment), except perhaps for some microbenchmarks.
> >> It's rather a means to avoid depending on the C runtime, particularly
> >> due to compiler-introduced memset calls.
>
> > OK, that's what I guessed but you didn't spell out.  So does it make sense
> > to mention -ffreestanding in the docs at least?  My fear is that we'd get
> > complaints that -O3 -finline-memset-loops turns nicely optimized memset
> > loops into dumb ones (via loop distribution and then stupid re-expansion).
> > So does it also make sense to turn off -floop-distribute-patterns[-memset]
> > with -finline-memset-loops?
>
> I don't think they should be tied together.  Verbose as it is, the
> expansion of memset is a sort of local optimum given what the compiler
> knows about length and alignment: minimizing what can be minimized,
> namely the compare count, by grouping stores in large straight-line
> blocks.
>
> Though an optimized memset could in theory perform better, whether
> through ifuncs or by bumping alignment, if you're tuning generated code
> for a specific target, and you know dest is aligned, the generated code
> can likely beat a general-purpose optimized memset, even if by a thin
> margin, such as the code that the general-purpose memset would have to
> run to detect the alignment and realize it doesn't need to be bumped,
> and to extend the byte value to be stored to wider modes.
>
> So I can envision cases in which -floop-distribute-patterns could turn
> highly inefficient stores into a memset with known-strict alignment and
> length multiplier, that could then be profitably expanded inline so as
> to take advantage of both for performance reasons.
>
> Indeed, when I started working on this, I thought the issue was
> performance, and this led me to pursue the store-by-multiple-pieces
> logic.  It can indeed bring about performance improvements, both over
> generic-target and highly-optimized memset implementations.  But it can
> also be put to use to avoid C runtime calls.  So while I wouldn't
> suggest enabling it by default at any optimization level, I wouldn't tie
> it to the single purpose of freestanding environments either.
>
>
> >> My initial goal was to be able to show that inline expansion would NOT
> >> bring about performance improvements, but performance was not the
> >> concern that led to the request.
> >>
> >> If the approach seems generally acceptable, I may even end up extending
> >> it to other such builtins.  I have a vague recollection that memcmp is
> >> also an issue for us.
>
> > The C/C++ runtime produce at least memmove, memcpy and memcmp as well.
>
> *nod*.  The others are far more challenging to expand inline in a way
> that could potentially be more performant:
>
> - memcmp can only do by_pieces when testing for equality, presumably
> because grouping multiple bytes into larger words to speed things up
> won't always get you the expected result if you just subtract the larger
> words, endianness reversal prior to subtracting might be required, which
> would harm performance.  I don't see that using similar
> power-of-two-sizes grouping strategies to minimize looping overheads
> would be so advantageous, if at all, given the need for testing and
> branching at every word.
>
> - memcpy seems doable, but all of the block moves other than cpymem
> assume non-overlapping memcpy.  Even if we were to output a test for
> overlap that a naïve expansion would break, and an alternate block to go
> backwards, all of the block copying logic would have to be "oriented" to
> proceed explicitly forward, backward, or don't-care, where currently we
> only have don't-care.
>
> Though my initial plan, when posting this patch, was to see how well the
> general approach was received, before thinking much about how to apply
> it to the other builtins, now I am concerned that extending it to them
> is more than I can tackle.
>
> Would it make more sense to extend it, even constrained by the
> limitations mentioned above, or handle memset only?  In the latter case,
> would it still make sense to adopt a command-line option that suggests a
> broader effect than it already has, even if it's only a hopeful future
> extension?  -finline-all-stringops[={memset,memcpy,...}], that you
> suggested, seems to be a reasonable and extensible one to adopt.

Well, if the main intent is to avoid relying on a C runtime for calls
generated by the compiler then yes!  Otherwise it would be incomplete.
In that light ...

> >> Is (optionally) tending to this (uncommon, I suppose) need (or
> >> preference?) not something GCC would like 

Re: [RFC] Introduce -finline-memset-loops

2023-01-19 Thread Alexandre Oliva via Gcc-patches
On Jan 16, 2023, Richard Biener  wrote:

> On Sat, Jan 14, 2023 at 2:55 AM Alexandre Oliva  wrote:
>> Target-specific code is great for tight optimizations, but the main
>> purpose of this feature is not an optimization.  AFAICT it actually
>> slows things down in general (due to code growth, and to conservative
>> assumptions about alignment), except perhaps for some microbenchmarks.
>> It's rather a means to avoid depending on the C runtime, particularly
>> due to compiler-introduced memset calls.

> OK, that's what I guessed but you didn't spell out.  So does it make sense
> to mention -ffreestanding in the docs at least?  My fear is that we'd get
> complaints that -O3 -finline-memset-loops turns nicely optimized memset
> loops into dumb ones (via loop distribution and then stupid re-expansion).
> So does it also make sense to turn off -floop-distribute-patterns[-memset]
> with -finline-memset-loops?

I don't think they should be tied together.  Verbose as it is, the
expansion of memset is a sort of local optimum given what the compiler
knows about length and alignment: minimizing what can be minimized,
namely the compare count, by grouping stores in large straight-line
blocks.

Though an optimized memset could in theory perform better, whether
through ifuncs or by bumping alignment, if you're tuning generated code
for a specific target, and you know dest is aligned, the generated code
can likely beat a general-purpose optimized memset, even if by a thin
margin, such as the code that the general-purpose memset would have to
run to detect the alignment and realize it doesn't need to be bumped,
and to extend the byte value to be stored to wider modes.

So I can envision cases in which -floop-distribute-patterns could turn
highly inefficient stores into a memset with known-strict alignment and
length multiplier, that could then be profitably expanded inline so as
to take advantage of both for performance reasons.

Indeed, when I started working on this, I thought the issue was
performance, and this led me to pursue the store-by-multiple-pieces
logic.  It can indeed bring about performance improvements, both over
generic-target and highly-optimized memset implementations.  But it can
also be put to use to avoid C runtime calls.  So while I wouldn't
suggest enabling it by default at any optimization level, I wouldn't tie
it to the single purpose of freestanding environments either.


>> My initial goal was to be able to show that inline expansion would NOT
>> bring about performance improvements, but performance was not the
>> concern that led to the request.
>> 
>> If the approach seems generally acceptable, I may even end up extending
>> it to other such builtins.  I have a vague recollection that memcmp is
>> also an issue for us.

> The C/C++ runtime produce at least memmove, memcpy and memcmp as well.

*nod*.  The others are far more challenging to expand inline in a way
that could potentially be more performant:

- memcmp can only do by_pieces when testing for equality, presumably
because grouping multiple bytes into larger words to speed things up
won't always get you the expected result if you just subtract the larger
words, endianness reversal prior to subtracting might be required, which
would harm performance.  I don't see that using similar
power-of-two-sizes grouping strategies to minimize looping overheads
would be so advantageous, if at all, given the need for testing and
branching at every word.

- memcpy seems doable, but all of the block moves other than cpymem
assume non-overlapping memcpy.  Even if we were to output a test for
overlap that a naïve expansion would break, and an alternate block to go
backwards, all of the block copying logic would have to be "oriented" to
proceed explicitly forward, backward, or don't-care, where currently we
only have don't-care.

Though my initial plan, when posting this patch, was to see how well the
general approach was received, before thinking much about how to apply
it to the other builtins, now I am concerned that extending it to them
is more than I can tackle.

Would it make more sense to extend it, even constrained by the
limitations mentioned above, or handle memset only?  In the latter case,
would it still make sense to adopt a command-line option that suggests a
broader effect than it already has, even if it's only a hopeful future
extension?  -finline-all-stringops[={memset,memcpy,...}], that you
suggested, seems to be a reasonable and extensible one to adopt.

>> Is (optionally) tending to this (uncommon, I suppose) need (or
>> preference?) not something GCC would like to do?

> Sure, I think for the specific intended purpose that would be fine.

Cool!

> It should also only apply to __builtin_memset calls, not to memset
> calls from user code?

I suppose it could be argued both ways.  The situations that I had in
mind either already are or could be made __builtin_memset calls, but I
can't think of reasons to prevent explicit memset calls 

Re: [RFC] Introduce -finline-memset-loops

2023-01-15 Thread Richard Biener via Gcc-patches
On Sat, Jan 14, 2023 at 2:55 AM Alexandre Oliva  wrote:
>
> Hello, Richard,
>
> Thank you for the feedback.
>
> On Jan 12, 2023, Richard Biener  wrote:
>
> > On Tue, Dec 27, 2022 at 5:12 AM Alexandre Oliva via Gcc-patches
> >  wrote:
>
> >> This patch extends the memset expansion to start with a loop, so as to
> >> still take advantage of known alignment even with long lengths, but
> >> without necessarily adding store blocks for every power of two.
>
> > I wonder if that isn't better handled by targets via the setmem pattern,
>
> That was indeed where I started, but then I found myself duplicating the
> logic in try_store_by_multiple_pieces on a per-target basis.
>
> Target-specific code is great for tight optimizations, but the main
> purpose of this feature is not an optimization.  AFAICT it actually
> slows things down in general (due to code growth, and to conservative
> assumptions about alignment), except perhaps for some microbenchmarks.
> It's rather a means to avoid depending on the C runtime, particularly
> due to compiler-introduced memset calls.

OK, that's what I guessed but you didn't spell out.  So does it make sense
to mention -ffreestanding in the docs at least?  My fear is that we'd get
complaints that -O3 -finline-memset-loops turns nicely optimized memset
loops into dumb ones (via loop distribution and then stupid re-expansion).
So does it also make sense to turn off -floop-distribute-patterns[-memset]
with -finline-memset-loops?

> My initial goal was to be able to show that inline expansion would NOT
> bring about performance improvements, but performance was not the
> concern that led to the request.
>
> If the approach seems generally acceptable, I may even end up extending
> it to other such builtins.  I have a vague recollection that memcmp is
> also an issue for us.

The C/C++ runtime produce at least memmove, memcpy and memcmp as well.
In this respect -finline-memset-loops is too specific and to avoid an explosion
in the number of command line options we should try to come up with sth
better?  -finline-all-stringops[={memset,memcpy,...}] (just like x86 has
-minline-all-stringops)?

> > like x86 has the stringop inline strathegy.  What is considered acceptable
> > in terms of size or performance will vary and I don't think there's much
> > room for improvements on this generic code support?
>
> *nod* x86 is quite finely tuned already; I suppose other targets may
> have some room for additional tuning, both for performance and for code
> size, but we don't have much affordance for avoiding builtin calls to
> the C runtime, which is what this is about.
>
> Sometimes disabling loop distribution is enough to accomplish that, but
> in some cases GNAT itself resorts to builtin memset calls, in ways that
> are not so easy to avoid, and that would ultimately amount to expanding
> memset inline, so I figured we might as well offer that as a general
> feature, for users to whom this matters.
>
> Is (optionally) tending to this (uncommon, I suppose) need (or
> preference?) not something GCC would like to do?

Sure, I think for the specific intended purpose that would be fine.  It should
also only apply to __builtin_memset calls, not to memset calls from user code?

Thanks,
Richard.

> --
> Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
>Free Software Activist   GNU Toolchain Engineer
> Disinformation flourishes because many people care deeply about injustice
> but very few check the facts.  Ask me about 


Re: [RFC] Introduce -finline-memset-loops

2023-01-13 Thread Alexandre Oliva via Gcc-patches
Hello, Paul,

On Jan 13, 2023, Paul Koning  wrote:

>> On Jan 13, 2023, at 8:54 PM, Alexandre Oliva via Gcc-patches
>>  wrote:

>> Target-specific code is great for tight optimizations, but the main
>> purpose of this feature is not an optimization.  AFAICT it actually
>> slows things down in general (due to code growth, and to conservative
>> assumptions about alignment), 

> I thought machinery like the memcpy patterns have as one of their
> benefits the ability to find the alignment of their operands and from
> that optimize things.  So I don't understand why you'd say
> "conservative".

Though memcpy implementations normally do that indeed, dynamically
increasing dest alignment has such an impact on code size that *inline*
memcpy doesn't normally do that.  try_store_by_multiple_pieces,
specifically, is potentially branch-heavy to begin with, and bumping
alignment up could double the inline expansion size.  So what it does is
to take the conservative dest alignment estimate from the compiler and
use it.

By adding leading loops to try_store_by_multiple_pieces (as does the
proposed patch, with its option enabled) we may expand an
unknown-length, unknown-alignment memset to something conceptually like
(cims is short for constant-sized inlined memset):

while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; }
if (len >= 32) { len -= 32; cims(dest, c, 32); dest += 32; }
if (len >= 16) { len -= 16; cims(dest, c, 16); dest += 16; }
if (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; }
if (len >= 4) { len -= 4; cims(dest, c, 4); dest += 4; }
if (len >= 2) { len -= 2; cims(dest, c, 2); dest += 2; }
if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }

With dynamic alignment bumps under a trivial extension of the current
logic, it would become (cimsN is short for cims with dest known to be
aligned to an N-byte boundary):

if (len >= 2 && (dest & 1)) { len -= 1; cims(dest, c, 1); dest += 1; }
if (len >= 4 && (dest & 2)) { len -= 2; cims2(dest, c, 2); dest += 2; }
if (len >= 8 && (dest & 4)) { len -= 4; cims4(dest, c, 4); dest += 4; }
if (len >= 16 && (dest & 8)) { len -= 8; cims8(dest, c, 8); dest += 8; }
if (len >= 32 && (dest & 16)) { len -= 16; cims16(dest, c, 16); dest += 16; }
if (len >= 64 && (dest & 32)) { len -= 32; cims32(dest, c, 32); dest += 32; }
while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; }
if (len >= 32) { len -= 32; cims32(dest, c, 32); dest += 32; }
if (len >= 16) { len -= 16; cims16(dest, c, 16); dest += 16; }
if (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; }
if (len >= 4) { len -= 4; cims4(dest, c, 4); dest += 4; }
if (len >= 2) { len -= 2; cims2(dest, c, 2); dest += 2; }
if (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }


Now, by using more loops instead of going through every power of two, We
could shorten (for -Os) the former to e.g.:

while (len >= 64) { len -= 64; cims(dest, c, 64); dest += 64; }
while (len >= 8) { len -= 8; cims(dest, c, 8); dest += 8; }
while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }

and we could similarly add more compact logic for dynamic alignment:

if (len >= 8) {
  while (dest & 7) { len -= 1; cims(dest, c, 1); dest += 1; }
  if (len >= 64)
while (dest & 56) { len -= 8; cims8(dest, c, 8); dest += 8; }
  while (len >= 64) { len -= 64; cims64(dest, c, 64); dest += 64; }
  while (len >= 8) { len -= 8; cims8(dest, c, 8); dest += 8; }
}
while (len >= 1) { len -= 1; cims(dest, c, 1); dest += 1; }


Now, given that improving performance was never goal of this change, and
the expansion it optionally offers is desirable even when it slows
things down, just making it a simple loop at the known alignment would
do.  The remainder sort of flowed out of the way
try_store_by_multiple_pieces was structured, and I found it sort of made
sense to start with the largest-reasonable block loop, and then end with
whatever try_store_by_multiple_pieces would have expanded a
known-shorter but variable length memset to.  And this is how I got to
it.  I'm not sure it makes any sense to try to change things further to
satisfy other competing goals such as performance or code size.

-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about 


Re: [RFC] Introduce -finline-memset-loops

2023-01-13 Thread Paul Koning via Gcc-patches



> On Jan 13, 2023, at 8:54 PM, Alexandre Oliva via Gcc-patches 
>  wrote:
> 
> Hello, Richard,
> 
> Thank you for the feedback.
> 
> On Jan 12, 2023, Richard Biener  wrote:
> 
>> On Tue, Dec 27, 2022 at 5:12 AM Alexandre Oliva via Gcc-patches
>>  wrote:
> 
>>> This patch extends the memset expansion to start with a loop, so as to
>>> still take advantage of known alignment even with long lengths, but
>>> without necessarily adding store blocks for every power of two.
> 
>> I wonder if that isn't better handled by targets via the setmem pattern,
> 
> That was indeed where I started, but then I found myself duplicating the
> logic in try_store_by_multiple_pieces on a per-target basis.
> 
> Target-specific code is great for tight optimizations, but the main
> purpose of this feature is not an optimization.  AFAICT it actually
> slows things down in general (due to code growth, and to conservative
> assumptions about alignment), 

I thought machinery like the memcpy patterns have as one of their benefits the 
ability to find the alignment of their operands and from that optimize things.  
So I don't understand why you'd say "conservative".

paul




Re: [RFC] Introduce -finline-memset-loops

2023-01-13 Thread Alexandre Oliva via Gcc-patches
Hello, Richard,

Thank you for the feedback.

On Jan 12, 2023, Richard Biener  wrote:

> On Tue, Dec 27, 2022 at 5:12 AM Alexandre Oliva via Gcc-patches
>  wrote:

>> This patch extends the memset expansion to start with a loop, so as to
>> still take advantage of known alignment even with long lengths, but
>> without necessarily adding store blocks for every power of two.

> I wonder if that isn't better handled by targets via the setmem pattern,

That was indeed where I started, but then I found myself duplicating the
logic in try_store_by_multiple_pieces on a per-target basis.

Target-specific code is great for tight optimizations, but the main
purpose of this feature is not an optimization.  AFAICT it actually
slows things down in general (due to code growth, and to conservative
assumptions about alignment), except perhaps for some microbenchmarks.
It's rather a means to avoid depending on the C runtime, particularly
due to compiler-introduced memset calls.

My initial goal was to be able to show that inline expansion would NOT
bring about performance improvements, but performance was not the
concern that led to the request.

If the approach seems generally acceptable, I may even end up extending
it to other such builtins.  I have a vague recollection that memcmp is
also an issue for us.

> like x86 has the stringop inline strathegy.  What is considered acceptable
> in terms of size or performance will vary and I don't think there's much
> room for improvements on this generic code support?

*nod* x86 is quite finely tuned already; I suppose other targets may
have some room for additional tuning, both for performance and for code
size, but we don't have much affordance for avoiding builtin calls to
the C runtime, which is what this is about.

Sometimes disabling loop distribution is enough to accomplish that, but
in some cases GNAT itself resorts to builtin memset calls, in ways that
are not so easy to avoid, and that would ultimately amount to expanding
memset inline, so I figured we might as well offer that as a general
feature, for users to whom this matters.

Is (optionally) tending to this (uncommon, I suppose) need (or
preference?) not something GCC would like to do?

-- 
Alexandre Oliva, happy hackerhttps://FSFLA.org/blogs/lxo/
   Free Software Activist   GNU Toolchain Engineer
Disinformation flourishes because many people care deeply about injustice
but very few check the facts.  Ask me about 


Re: [RFC] Introduce -finline-memset-loops

2023-01-12 Thread Richard Biener via Gcc-patches
On Tue, Dec 27, 2022 at 5:12 AM Alexandre Oliva via Gcc-patches
 wrote:
>
>
> try_store_by_multiple_pieces was added not long ago, enabling
> variable-sized memset to be expanded inline when the worst-case
> in-range constant length would, using conditional blocks with powers
> of two to cover all possibilities of length and alignment.
>
> This patch extends the memset expansion to start with a loop, so as to
> still take advantage of known alignment even with long lengths, but
> without necessarily adding store blocks for every power of two.
>
> This makes it possible for any memset call to be expanded, even if
> storing a single byte per iteration.  Surely efficient implementations
> of memset can do better, with a pre-loop to increase alignment, but
> that would likely be excessive for inline expansions of memset.
>
> Still, in some cases, users prefer to inline memset, even if it's not
> as performant, or when it's known to be performant in ways the
> compiler can't tell, to avoid depending on a C runtime library.
>
> With this flag, global or per-function optimizations may enable inline
> expansion of memset to be selectively requested, while the
> infrastructure for that may enable us to introduce per-target tuning
> to enable such looping even advantageous, even if not explicitly
> requested.
>
>
> I realize this is late for new features in this cycle; I`d be happy to
> submit it again later, but I wonder whether there's any interest in this
> feature, or any objections to it.  FWIW, I've regstrapped this on
> x86_64-linux-gnu, and also tested earlier versions of this patch in
> earlier GCC branches with RISC-v crosses.  Is this ok for GCC 14?  Maybe
> even simple enough for GCC 13, considering it's disabled by default?

I wonder if that isn't better handled by targets via the setmem pattern,
like x86 has the stringop inline strathegy.  What is considered acceptable
in terms of size or performance will vary and I don't think there's much
room for improvements on this generic code support?

Richard.

> TIA,
>
>
> for  gcc/ChangeLog
>
> * builtins.cc (try_store_by_multiple_pieces): Support starting
> with a loop.
> * common.opt (finline-memset-loops): New.
> * doc/invoke.texi (-finline-memset-loops): Add.
>
> for  gcc/testsuite/ChangeLog
>
> * gcc.dg/torture/inline-mem-set-1.c: New.
> ---
>  gcc/builtins.cc |   50 
> ++-
>  gcc/common.opt  |4 ++
>  gcc/doc/invoke.texi |   13 ++
>  gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c |   14 ++
>  4 files changed, 77 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c
>
> diff --git a/gcc/builtins.cc b/gcc/builtins.cc
> index 02c4fefa86f48..388bae58ce49e 100644
> --- a/gcc/builtins.cc
> +++ b/gcc/builtins.cc
> @@ -4361,9 +4361,37 @@ try_store_by_multiple_pieces (rtx to, rtx len, 
> unsigned int ctz_len,
>if (max_bits >= 0)
>  xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
> - (HOST_WIDE_INT_1U << ctz_len));
> +  bool max_loop = false;
>if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
> , align, true))
> -return false;
> +{
> +  if (!flag_inline_memset_loops)
> +   return false;
> +  while (--max_bits >= sctz_len)
> +   {
> + xlenest = ((HOST_WIDE_INT_1U << max_bits) * 2
> +- (HOST_WIDE_INT_1U << ctz_len));
> + if (can_store_by_pieces (xlenest + blksize,
> +  builtin_memset_read_str,
> +  , align, true))
> +   {
> + max_loop = true;
> + break;
> +   }
> + if (!blksize)
> +   continue;
> + if (can_store_by_pieces (xlenest,
> +  builtin_memset_read_str,
> +  , align, true))
> +   {
> + blksize = 0;
> + max_loop = true;
> + break;
> +   }
> +   }
> +  if (!max_loop)
> +   return false;
> +}
>
>by_pieces_constfn constfun;
>void *constfundata;
> @@ -4405,6 +4433,7 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
> int ctz_len,
>   the least significant bit possibly set in the length.  */
>for (int i = max_bits; i >= sctz_len; i--)
>  {
> +  rtx_code_label *loop_label = NULL;
>rtx_code_label *label = NULL;
>blksize = HOST_WIDE_INT_1U << i;
>
> @@ -4423,14 +4452,24 @@ try_store_by_multiple_pieces (rtx to, rtx len, 
> unsigned int ctz_len,
>else if ((max_len & blksize) == 0)
> continue;
>
> +  if (max_loop && i == max_bits)
> +   {
> + loop_label = gen_label_rtx ();
> + emit_label (loop_label);
> + /* Since we may run this multiple times, don't assume we
> +know 

[RFC] Introduce -finline-memset-loops

2022-12-26 Thread Alexandre Oliva via Gcc-patches


try_store_by_multiple_pieces was added not long ago, enabling
variable-sized memset to be expanded inline when the worst-case
in-range constant length would, using conditional blocks with powers
of two to cover all possibilities of length and alignment.

This patch extends the memset expansion to start with a loop, so as to
still take advantage of known alignment even with long lengths, but
without necessarily adding store blocks for every power of two.

This makes it possible for any memset call to be expanded, even if
storing a single byte per iteration.  Surely efficient implementations
of memset can do better, with a pre-loop to increase alignment, but
that would likely be excessive for inline expansions of memset.

Still, in some cases, users prefer to inline memset, even if it's not
as performant, or when it's known to be performant in ways the
compiler can't tell, to avoid depending on a C runtime library.

With this flag, global or per-function optimizations may enable inline
expansion of memset to be selectively requested, while the
infrastructure for that may enable us to introduce per-target tuning
to enable such looping even advantageous, even if not explicitly
requested.


I realize this is late for new features in this cycle; I`d be happy to
submit it again later, but I wonder whether there's any interest in this
feature, or any objections to it.  FWIW, I've regstrapped this on
x86_64-linux-gnu, and also tested earlier versions of this patch in
earlier GCC branches with RISC-v crosses.  Is this ok for GCC 14?  Maybe
even simple enough for GCC 13, considering it's disabled by default?
TIA,


for  gcc/ChangeLog

* builtins.cc (try_store_by_multiple_pieces): Support starting
with a loop.
* common.opt (finline-memset-loops): New.
* doc/invoke.texi (-finline-memset-loops): Add.

for  gcc/testsuite/ChangeLog

* gcc.dg/torture/inline-mem-set-1.c: New.
---
 gcc/builtins.cc |   50 ++-
 gcc/common.opt  |4 ++
 gcc/doc/invoke.texi |   13 ++
 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c |   14 ++
 4 files changed, 77 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/inline-mem-set-1.c

diff --git a/gcc/builtins.cc b/gcc/builtins.cc
index 02c4fefa86f48..388bae58ce49e 100644
--- a/gcc/builtins.cc
+++ b/gcc/builtins.cc
@@ -4361,9 +4361,37 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
int ctz_len,
   if (max_bits >= 0)
 xlenest += ((HOST_WIDE_INT_1U << max_bits) * 2
- (HOST_WIDE_INT_1U << ctz_len));
+  bool max_loop = false;
   if (!can_store_by_pieces (xlenest, builtin_memset_read_str,
, align, true))
-return false;
+{
+  if (!flag_inline_memset_loops)
+   return false;
+  while (--max_bits >= sctz_len)
+   {
+ xlenest = ((HOST_WIDE_INT_1U << max_bits) * 2
+- (HOST_WIDE_INT_1U << ctz_len));
+ if (can_store_by_pieces (xlenest + blksize,
+  builtin_memset_read_str,
+  , align, true))
+   {
+ max_loop = true;
+ break;
+   }
+ if (!blksize)
+   continue;
+ if (can_store_by_pieces (xlenest,
+  builtin_memset_read_str,
+  , align, true))
+   {
+ blksize = 0;
+ max_loop = true;
+ break;
+   }
+   }
+  if (!max_loop)
+   return false;
+}
 
   by_pieces_constfn constfun;
   void *constfundata;
@@ -4405,6 +4433,7 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
int ctz_len,
  the least significant bit possibly set in the length.  */
   for (int i = max_bits; i >= sctz_len; i--)
 {
+  rtx_code_label *loop_label = NULL;
   rtx_code_label *label = NULL;
   blksize = HOST_WIDE_INT_1U << i;
 
@@ -4423,14 +4452,24 @@ try_store_by_multiple_pieces (rtx to, rtx len, unsigned 
int ctz_len,
   else if ((max_len & blksize) == 0)
continue;
 
+  if (max_loop && i == max_bits)
+   {
+ loop_label = gen_label_rtx ();
+ emit_label (loop_label);
+ /* Since we may run this multiple times, don't assume we
+know anything about the offset.  */
+ clear_mem_offset (to);
+   }
+
   /* Issue a store of BLKSIZE bytes.  */
+  bool update_needed = i != sctz_len || loop_label;
   to = store_by_pieces (to, blksize,
constfun, constfundata,
align, true,
-   i != sctz_len ? RETURN_END : RETURN_BEGIN);
+   update_needed ? RETURN_END : RETURN_BEGIN);
 
   /* Adjust REM and PTR, unless this is the last iteration.  */
-  if (i != sctz_len)
+  if (update_needed)