Re: [libc-coord] Add new ABI '__memcmpeq()' to libc

2021-09-22 Thread Christoph Müllner via Gcc
Would it make sense to extend this proposal to include __strcmpeq()
and __strncmpeq()?

Both are already available internally in GCC in form of
BUILT_IN_STRCMP_EQ and BUILT_IN_STRNCMP_EQ
(tree-ssa-strlen.c detects them in handle_builtin_string_cmp() and
builtins.c tries to inline them in expand_builtin_memcmp()).
However, they are currently restricted to cases where the length of
the string or the size of the array (of both arguments) is known.

A use case for strcmpeq() would be the comparison of std::type_info
objects (equality and inequality operator) in libstdc++.

On Tue, Sep 21, 2021 at 9:54 PM Noah Goldstein via Gcc  wrote:
>
> On Fri, Sep 17, 2021 at 9:27 AM Florian Weimer via Libc-alpha <
> libc-al...@sourceware.org> wrote:
>
> > * Joseph Myers:
> >
> > > I was supposing a build-time decision (using
> > GCC_GLIBC_VERSION_GTE_IFELSE
> > > to know if the glibc version on the target definitely has this
> > function).
> > > But if we add a header declaration, you could check for __memcmpeq being
> > > declared (and so cover arbitrary C libraries, not just glibc, and avoid
> > > issues of needing to disable this logic for freestanding compilations,
> > > which would otherwise be an issue if a glibc-target toolchain is used
> > for
> > > a freestanding kernel compilation).  The case of people calling
> > > __builtin_memcmp (or declaring memcmp themselves) without string.h
> > > included probably isn't one it's important to optimize.
> >
> > The header-less case looks relevant to C++ and other language front
> > ends, though.  So a GCC_GLIBC_VERSION_GTE_IFELSE check could still make
> > sense for them.
> >
> > (Dropping libc-coord.)
> >
> > Thanks,
> > Florian
> >
> >
> What are we going with?
>
> Should I go forward with the proposal in GLIBC?
>
> If I should go forward with it should I include a def in string.h?


Re: Priority of builtins expansion strategies

2021-07-13 Thread Christoph Müllner via Gcc
On Tue, Jul 13, 2021 at 2:59 PM Richard Biener
 wrote:
>
> On Tue, Jul 13, 2021 at 2:19 PM Christoph Müllner via Gcc
>  wrote:
> >
> > On Tue, Jul 13, 2021 at 2:11 AM Alexandre Oliva  wrote:
> > >
> > > On Jul 12, 2021, Christoph Müllner  wrote:
> > >
> > > > * Why does the generic by-pieces infrastructure have a higher priority
> > > > than the target-specific expansion via INSNs like setmem?
> > >
> > > by-pieces was not affected by the recent change, and IMHO it generally
> > > makes sense for it to have priority over setmem.  It generates only
> > > straigh-line code for constant-sized blocks.  Even if you can beat that
> > > with some machine-specific logic, you'll probably end up generating
> > > equivalent code at least in some cases, and then, you probably want to
> > > carefully tune the settings that select one or the other, or disable
> > > by-pieces altogether.
> > >
> > >
> > > by-multiple-pieces, OTOH, is likely to be beaten by machine-specific
> > > looping constructs, if any are available, so setmem takes precedence.
> > >
> > > My testing involved bringing it ahead of the insns, to exercise the code
> > > more thoroughly even on x86*, but the submitted patch only used
> > > by-multiple-pieces as a fallback.
> >
> > Let me give you an example of what by-pieces does on RISC-V (RV64GC).
> > The following code...
> >
> > void* do_memset0_8 (void *p)
> > {
> > return memset (p, 0, 8);
> > }
> >
> > void* do_memset0_15 (void *p)
> > {
> > return memset (p, 0, 15);
> > }
> >
> > ...becomes (you can validate that with compiler explorer):
> >
> > do_memset0_8(void*):
> > sb  zero,0(a0)
> > sb  zero,1(a0)
> > sb  zero,2(a0)
> > sb  zero,3(a0)
> > sb  zero,4(a0)
> > sb  zero,5(a0)
> > sb  zero,6(a0)
> > sb  zero,7(a0)
> > ret
> > do_memset0_15(void*):
> > sb  zero,0(a0)
> > sb  zero,1(a0)
> > sb  zero,2(a0)
> > sb  zero,3(a0)
> > sb  zero,4(a0)
> > sb  zero,5(a0)
> > sb  zero,6(a0)
> > sb  zero,7(a0)
> > sb  zero,8(a0)
> > sb  zero,9(a0)
> > sb  zero,10(a0)
> > sb  zero,11(a0)
> > sb  zero,12(a0)
> > sb  zero,13(a0)
> > sb  zero,14(a0)
> > ret
> >
> > Here is what a setmemsi expansion in the backend can do (in case
> > unaligned access is cheap):
> >
> > 003c :
> >   3c:   00053023sd  zero,0(a0)
> >   40:   8082ret
> >
> > 007e :
> >   7e:   00053023sd  zero,0(a0)
> >   82:   000533a3sd  zero,7(a0)
> >   86:   8082ret
> >
> > Is there a way to generate similar code with the by-pieces infrastructure?
>
> Sure - tell it unaligned access is cheap.  See alignment_for_piecewise_move
> and how it uses slow_unaligned_access.

Thanks for the pointer.
I already knew about slow_unaligned_access, but I was not aware of
overlap_op_by_pieces_p.
Enabling both gives exactly the same as above.

Thanks,
Christoph

> > > > * And if there are no particular reasons, would it be acceptable to
> > > > change the order?
> > >
> > > I suppose moving insns ahead of by-pieces might break careful tuning of
> > > multiple platforms, so I'd rather we did not make that change.
> >
> > Only platforms that have "setmemsi" implemented would be affected.
> > And those platforms (arm, frv, ft32, nds32, pa, rs6000, rx, visium)
> > have a carefully tuned
> > implementation of the setmem expansion. I can't imagine that these
> > setmem expansions
> > produce less optimal code than the by-pieces infrastructure (which has
> > less knowledge
> > about the target).
> >
> > Thanks,
> > Christoph


Re: Priority of builtins expansion strategies

2021-07-13 Thread Christoph Müllner via Gcc
On Tue, Jul 13, 2021 at 2:11 AM Alexandre Oliva  wrote:
>
> On Jul 12, 2021, Christoph Müllner  wrote:
>
> > * Why does the generic by-pieces infrastructure have a higher priority
> > than the target-specific expansion via INSNs like setmem?
>
> by-pieces was not affected by the recent change, and IMHO it generally
> makes sense for it to have priority over setmem.  It generates only
> straigh-line code for constant-sized blocks.  Even if you can beat that
> with some machine-specific logic, you'll probably end up generating
> equivalent code at least in some cases, and then, you probably want to
> carefully tune the settings that select one or the other, or disable
> by-pieces altogether.
>
>
> by-multiple-pieces, OTOH, is likely to be beaten by machine-specific
> looping constructs, if any are available, so setmem takes precedence.
>
> My testing involved bringing it ahead of the insns, to exercise the code
> more thoroughly even on x86*, but the submitted patch only used
> by-multiple-pieces as a fallback.

Let me give you an example of what by-pieces does on RISC-V (RV64GC).
The following code...

void* do_memset0_8 (void *p)
{
return memset (p, 0, 8);
}

void* do_memset0_15 (void *p)
{
return memset (p, 0, 15);
}

...becomes (you can validate that with compiler explorer):

do_memset0_8(void*):
sb  zero,0(a0)
sb  zero,1(a0)
sb  zero,2(a0)
sb  zero,3(a0)
sb  zero,4(a0)
sb  zero,5(a0)
sb  zero,6(a0)
sb  zero,7(a0)
ret
do_memset0_15(void*):
sb  zero,0(a0)
sb  zero,1(a0)
sb  zero,2(a0)
sb  zero,3(a0)
sb  zero,4(a0)
sb  zero,5(a0)
sb  zero,6(a0)
sb  zero,7(a0)
sb  zero,8(a0)
sb  zero,9(a0)
sb  zero,10(a0)
sb  zero,11(a0)
sb  zero,12(a0)
sb  zero,13(a0)
sb  zero,14(a0)
ret

Here is what a setmemsi expansion in the backend can do (in case
unaligned access is cheap):

003c :
  3c:   00053023sd  zero,0(a0)
  40:   8082ret

007e :
  7e:   00053023sd  zero,0(a0)
  82:   000533a3sd  zero,7(a0)
  86:   8082ret

Is there a way to generate similar code with the by-pieces infrastructure?

> > * And if there are no particular reasons, would it be acceptable to
> > change the order?
>
> I suppose moving insns ahead of by-pieces might break careful tuning of
> multiple platforms, so I'd rather we did not make that change.

Only platforms that have "setmemsi" implemented would be affected.
And those platforms (arm, frv, ft32, nds32, pa, rs6000, rx, visium)
have a carefully tuned
implementation of the setmem expansion. I can't imagine that these
setmem expansions
produce less optimal code than the by-pieces infrastructure (which has
less knowledge
about the target).

Thanks,
Christoph


Priority of builtins expansion strategies

2021-07-12 Thread Christoph Müllner via Gcc
Hi,

I'm working on some platform-specific optimizations for
memset/memcpy/strcpy/strncpy.
However, I am having difficulties understanding how my code should be
integrated.
Initially, I got inspired by rs6000-string.c, where I see expansion
code for instructions
like setmemsi or cmpstrsi. However, that expansion code is not always called.
Instead, the first strategy is using the generic by-pieces infrastructure.

To understand what I mean, let's have a look at memset
(expand_builtin_memset_args).
The backend can provide a tailored code sequence by expanding setmem.
However, there is also a generic solution available using the
by-pieces infrastructure.
The generic by-pieces infrastructure has a higher priority than the
target-specific setmem
expansion. However, the recently added by-multiple-pieces
infrastructure has lower priority
than setmem.

See:
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/builtins.c;h=39ab139b7e1c06c98d2db1aef2b3a6095ffbec63;hb=HEAD#l7004

The same observation is true for most (all?) other uses of builtins.

The current priority requires me to duplicate the condition code to
decide if my optimization
can be applied to the following places:
1) in TARGET_USE_BY_PIECES_INFRASTRUCTURE_P () to block by-pieces
2) in the setmem expansion to gate the optimization

As I would expect  that a target-specific mechanism is preferred over
a generic mechanism,
my questions are:
* Why does the generic by-pieces infrastructure have a higher priority
than the target-specific expansion via INSNs like setmem?
* And if there are no particular reasons, would it be acceptable to
change the order?

Thanks,
Christoph


Re: LTO Dead Field Elimination

2020-07-27 Thread Christoph Müllner
Hi Richard,

On 7/27/20 2:36 PM, Richard Biener wrote:
> On Fri, Jul 24, 2020 at 5:43 PM Erick Ochoa
>  wrote:
>>
>> This patchset brings back struct reorg to GCC.
>>
>> We’ve been working on improving cache utilization recently and would
>> like to share our current implementation to receive some feedback on it.
>>
>> Essentially, we’ve implemented the following components:
>>
>>  Type-based escape analysis to determine if we can reorganize a type
>> at link-time
>>
>>  Dead-field elimination to remove unused fields of a struct at
>> link-time
>>
>> The type-based escape analysis provides a list of types, that are not
>> visible outside of the current linking unit (e.g. parameter types of
>> external functions).
>>
>> The dead-field elimination pass analyses non-escaping structs for fields
>> that are not used in the linking unit and thus can be removed. The
>> resulting struct has a smaller memory footprint, which allows for a
>> higher cache utilization.
>>
>> As a side-effect a couple of new infrastructure code has been written
>> (e.g. a type walker, which we were really missing in GCC), which can be
>> of course reused for other passes as well.
>>
>> We’ve prepared a patchset in the following branch:
>>
>>refs/vendors/ARM/heads/arm-struct-reorg-wip
> 
> Just had some time to peek into this.  Ugh.  The code doesn't look like
> GCC code looks :/  It doesn't help to have one set of files per C++ class 
> (25!).

Any suggestions how to best structure these?
Are there some coding guidelines in the GCC project,
which can help us to match the expectation?

> The code itself is undocumented - it's hard to understand what the purpose
> of all the Walker stuff is.
> 
> You also didn't seem to know walk_tree () nor walk_gimple* ().

True, we were not aware of that code.
Thanks for pointing to that code.
We will have a look.

> Take as example - I figured to look for IPA pass entries, then I see
> 
> +
> +static void
> +collect_types ()
> +{
> +  GimpleTypeCollector collector;
> +  collector.walk ();
> +  collector.print_collected ();
> +  ptrset_t types = collector.get_pointer_set ();
> +  GimpleCaster caster (types);
> +  caster.walk ();
> +  if (flag_print_cast_analysis)
> +caster.print_reasons ();
> +  ptrset_t casting = caster.get_sets ();
> +  fix_escaping_types_in_set (casting);
> +  GimpleAccesser accesser;
> +  accesser.walk ();
> +  if (flag_print_access_analysis)
> +accesser.print_accesses ();
> +  record_field_map_t record_field_map = accesser.get_map ();
> +  TypeIncompleteEquality equality;
> +  bool has_fields_that_can_be_deleted = false;
> +  typedef std::set field_offsets_t;
> 
> there's no comments (not even file-level) that explains how type escape
> is computed.
> 
> Sorry, but this isn't even close to be coarsely reviewable.

Sad to hear.
We'll work on the input that you provided and provide a new version.

Thanks,
Christoph

> 
>> We’ve also added a subsection in the GCC internals document to allow
>> other compiler devs to better understand our design and implementation.
>> A generated PDF can be found here:
>>
>> https://cloud.theobroma-systems.com/s/aWwxPiDJ3nCgc7F
>> https://cloud.theobroma-systems.com/s/aWwxPiDJ3nCgc7F/download
>>
>> page: 719
>>
>> We’ve been testing the pass against a range of in-tree tests and
>> real-life applications (e.g. all SPEC CPU2017 C benchmarks). For
>> testing, please see testing subsection in the gcc internals we prepared.
>>
>> Currently we see the following limitations:
>>
>> * It is not a "true" ipa pass yes. That is, we can only succeed with
>> -flto-partition=none.
>> * Currently it is not safe to use -fipa-sra.
>> * Brace constructors not supported now. We handle this gracefully.
>> * Only C as of now.
>> * Results of sizeof() and offsetof() are generated in the compiler
>> frontend and thus can’t be changed later at link time. There are a
>> couple of ideas to resolve this, but that’s currently unimplemented.
>> * At this point we’d like to thank the GCC community for their patient
>> help so far on the mailing list and in other channels. And we ask for
>> your support in terms of feedback, comments and testing.
>>
>> Thanks!