[Bug target/93177] PPC: Missing many useful platform intrinsics

2022-10-28 Thread vital.had at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #22 from Sergey Fedorov  ---
(In reply to Iain Sandoe from comment #19)
> Created attachment 53779 [details]
> introduce ppc_intrinsics.h for powerpc*-darwin.
> 
> This takes the header from the GCC-4.x apple debt branch (as present in SVN:
> r113478) and 
>  - updates the license.
>  - installs for powerpc*-darwin
> 
> It needs the test cases forward porting too.
> However, it would be good to know if this solves the problems folks have
> encountered here (if other ports want to try it, why only need to amend
> their entry in gcc/config.gcc)

Thank you! I will try it.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2022-10-26 Thread iains at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #21 from Iain Sandoe  ---
(In reply to Segher Boessenkool from comment #18)
> (In reply to Sergey Fedorov from comment #16)
> > For Darwin, PPC intrinsics already is there in Apple headers. Can it be
> > added into current GCC?
> 
> If it is in the Apple headers already, why would you need a separate copy
> in GCC?

it's an internal header in apple-gcc-4.x so not accessible to end users unless
using those compiles (nor usable like  by GCC for example).

Some projects appear to depend on it (whether broken or not)...
.. but I'd welcome some persuasive evidence that it does make things better.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2022-10-26 Thread iains at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #20 from Iain Sandoe  ---
the patch above does not seek to answer questions on validity - it simply
publishes the same header that was made available in the darwin toolchains (so
will be neither better nor worse than that)>

[Bug target/93177] PPC: Missing many useful platform intrinsics

2022-10-26 Thread iains at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #19 from Iain Sandoe  ---
Created attachment 53779
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53779=edit
introduce ppc_intrinsics.h for powerpc*-darwin.

This takes the header from the GCC-4.x apple debt branch (as present in SVN:
r113478) and 
 - updates the license.
 - installs for powerpc*-darwin

It needs the test cases forward porting too.
However, it would be good to know if this solves the problems folks have
encountered here (if other ports want to try it, why only need to amend their
entry in gcc/config.gcc)

[Bug target/93177] PPC: Missing many useful platform intrinsics

2022-10-26 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #18 from Segher Boessenkool  ---
(In reply to Sergey Fedorov from comment #16)
> For Darwin, PPC intrinsics already is there in Apple headers. Can it be
> added into current GCC?

If it is in the Apple headers already, why would you need a separate copy
in GCC?

Also please note that as said many of those things do not work with current
GCC, and arguably didn't work with older GCC either, the user just got lucky
that the random translation he got did what he wanted :-/  Things like "sync"
or "dcbst" need proper dependencies, things like lwarx are *impossible* to
do, etc.

But thought-out patches are welcome :-)

[Bug target/93177] PPC: Missing many useful platform intrinsics

2022-10-26 Thread iains at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #17 from Iain Sandoe  ---
(In reply to Sergey Fedorov from comment #16)
> For Darwin, PPC intrinsics already is there in Apple headers. Can it be
> added into current GCC?

There is a version of the header on the FSF Apple branch, which means we can
forward port / check / add to the installation / test it.  It might/might not
be useful to any other ppc sub-ports.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2022-10-26 Thread vital.had at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #16 from Sergey Fedorov  ---
For Darwin, PPC intrinsics already is there in Apple headers. Can it be added
into current GCC?

[Bug target/93177] PPC: Missing many useful platform intrinsics

2022-10-04 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

Andrew Pinski  changed:

   What|Removed |Added

 CC||vital.had at gmail dot com

--- Comment #15 from Andrew Pinski  ---
*** Bug 107155 has been marked as a duplicate of this bug. ***

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-24 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #14 from Matt Emmerton  ---
I'd like to thank everyone for the great discussion so far.
Here's a summary of where we are at this point.

1) sync intrinsics

Useful, but with caveats.

2) cache prefetch intrinsics

Implemented via __builtin_prefetch()

3) larx/stcx intrinsics

Useful, but with caveats.

Improvements to stcx CR handling, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93417

4) streaming cache prefetch intrinsics

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93408

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-23 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #13 from Segher Boessenkool  ---
You cannot use that from intrinsics.  But the target code can do similar of
course, whether or not asm syntax for this exists.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-23 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #12 from Andrew Pinski  ---
If PowerPC back-end supported the "Flag Output Operands" part of GCC's
inline-asm, you could use that to do the correct thing.  But sadly PowerPC does
not currently.

https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Flag-Output-Operands

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-23 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #11 from Matt Emmerton  ---
> > > > The implementation of stwcx() and stdcx() need revision on PPC.
> > > > As I understand it, there is no need the mfocrf instruction nor the
> > > > mask-and-shift on result.
> > > 
> > > How else would you output the CR0.EQ bit?
> > 
> > There is no need to copy CR0 to a GPR - branch instructions such as BNE can
> > operate on CR0 directly.
> 
> You cannot write anything that maps to a CR field directly.

No need to access it directly - just use a BNE instruction (to branch for
retry/success) which operates implicitly on CR0.EQ.

There are plenty of material out there that implements atomic operations on
POWER like this:

loop:
lwarx
// do something
stwcx
bne loop:

gcc does an unnecessary mfocrf + cmp to achieve the same result.

Is there an assumption in gcc that the "result" of any intrinsic is reported in
a GPR, which disallows this implicit use of CR0?

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-23 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #10 from Segher Boessenkool  ---
(In reply to Matt Emmerton from comment #9)
> > > __sync()
> > > __isync()
> > > __lwsync()
> > 
> > The sync intrinsics need to be tied to some other code.  A volatile asm with
> > a "memory" clobber is not good enough, in many cases.
> 
> We use these in our internal mutex and atomic implementations, and the
> resulting sequences are carefully scrutinized.

You have to check it after *every build* then, in general :-/

> > > __lwarx()
> > > __ldarx()
> > > __stwcx()
> > > __stdcx()
> > 
> > The compiler can always insert memory accesses in between those two, if you
> > have them as separate intrinsics (and it will, simply stack accesses for
> > temporaries will do, already).  If those accesses hit the same reservation
> > granule as the larx/stcx. uses, you lose.
> > 
> > You need to write the whole sequence in one piece of assembler code.
> 
> I would argue that the compiler should be smart enough to realize that these
> are part of a decomposed atomic operation, and avoid arbitrary instruction
> injection.

But this is impossible, it is contrary to all optimisation goals we have.  Yes,
It could perhaps work with -O0.

> > > __protected_stream_set()
> > > __protected_stream_count()
> > > __protected_stream_count_depth() // currently not implemented in gcc
> > > __protected_stream_go()
> > 
> > Those are pretty specific to CBE I think?
> 
> No.  They are implemented on POWER5 and above (ISA 2.02), and are useful in
> managing cache prefetch behaviour.

Open a separate feature request for these then, please.

> > > The implementation of stwcx() and stdcx() need revision on PPC.
> > > As I understand it, there is no need the mfocrf instruction nor the
> > > mask-and-shift on result.
> > 
> > How else would you output the CR0.EQ bit?
> 
> There is no need to copy CR0 to a GPR - branch instructions such as BNE can
> operate on CR0 directly.

You cannot write anything that maps to a CR field directly.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-13 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #9 from Matt Emmerton  ---
(In reply to Segher Boessenkool from comment #6)
> (In reply to Matt Emmerton from comment #4)
> > The intrinsics that we would find useful, having used them as provided by
> > the IBM XL C/C++ compiler, are the following:
> > 
> > __sync()
> > __isync()
> > __lwsync()
> 
> The sync intrinsics need to be tied to some other code.  A volatile asm with
> a "memory" clobber is not good enough, in many cases.

We use these in our internal mutex and atomic implementations, and the
resulting sequences are carefully scrutinized.

> > __lwarx()
> > __ldarx()
> > __stwcx()
> > __stdcx()
> 
> The compiler can always insert memory accesses in between those two, if you
> have them as separate intrinsics (and it will, simply stack accesses for
> temporaries will do, already).  If those accesses hit the same reservation
> granule as the larx/stcx. uses, you lose.
> 
> You need to write the whole sequence in one piece of assembler code.

I would argue that the compiler should be smart enough to realize that these
are part of a decomposed atomic operation, and avoid arbitrary instruction
injection.

As per my previous update, we use these primitives to implement things that the
bulitin __atomic_* functions do not implement.

> > __protected_stream_set()
> > __protected_stream_count()
> > __protected_stream_count_depth() // currently not implemented in gcc
> > __protected_stream_go()
> 
> Those are pretty specific to CBE I think?

No.  They are implemented on POWER5 and above (ISA 2.02), and are useful in
managing cache prefetch behaviour.

> > The implementation of stwcx() and stdcx() need revision on PPC.
> > As I understand it, there is no need the mfocrf instruction nor the
> > mask-and-shift on result.
> 
> How else would you output the CR0.EQ bit?

There is no need to copy CR0 to a GPR - branch instructions such as BNE can
operate on CR0 directly.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-13 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #8 from Matt Emmerton  ---
(In reply to Andrew Pinski from comment #5)
> > __lwarx()
> > __ldarx()
> > __stwcx()
> > __stdcx()
> 
> Is there a reason why the __atomic_* builtins don't work?

There are places in our code where we do manipulations of the lockword that
cannot be emulated by the __atomic_* builtins, and thus require us to emit
discrete larx/stcx instructions (with other goodness in between.)

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #7 from Segher Boessenkool  ---
(In reply to Andrew Pinski from comment #5)
> >the cntlz ones are not, for example
> 
> :)  It has been a long time since I touched this but I would not doubt I
> messed up this too.

It's nastiness in the generic builtins.  builtin_clz(0) is undefined, even
if it *is* defined for the machine pattern.  This is so that code using the
builtin can be portable.  Unfortunately there is no good way (or I don't know
it, anyway) to do something like

  int f(int x) { return x ? __builtin_clz(x) : 32; }

so that it compiles to just a cntlzw insn (instead, it currently does a branch
and stuff :-( ).

> __mulh* intrinsics are better implemented these days using either 64bit or
> 128bit multiples.

Yup.

> __l[hwd]brx/__st[hwd]brx intrinsics are better implemented as
> __builtin_bswap* followed by load/stored these days (the bswap builtins did
> not exist back then or optimized)

Yup.

> Many of the other intrinsics should be implemented as non inline-asm too,
> even fma, should be done using __builtin_fma :).

Yup :-)

GCC has come a long way, since Cell :-)  You can reliably write many things
just as high-level C code now, and trust that well-optimised machine code
falls out.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #6 from Segher Boessenkool  ---
(In reply to Matt Emmerton from comment #4)
> The intrinsics that we would find useful, having used them as provided by
> the IBM XL C/C++ compiler, are the following:
> 
> __sync()
> __isync()
> __lwsync()

The sync intrinsics need to be tied to some other code.  A volatile asm with
a "memory" clobber is not good enough, in many cases.

> __dcbt()
> __dcbtst()

Those are builtin_prefetch().

> __lwarx()
> __ldarx()
> __stwcx()
> __stdcx()

The compiler can always insert memory accesses in between those two, if you
have them as separate intrinsics (and it will, simply stack accesses for
temporaries will do, already).  If those accesses hit the same reservation
granule as the larx/stcx. uses, you lose.

You need to write the whole sequence in one piece of assembler code.

> __protected_stream_set()
> __protected_stream_count()
> __protected_stream_count_depth() // currently not implemented in gcc
> __protected_stream_go()

Those are pretty specific to CBE I think?

> The implementation of stwcx() and stdcx() need revision on PPC.
> As I understand it, there is no need the mfocrf instruction nor the
> mask-and-shift on result.

How else would you output the CR0.EQ bit?

mfocrf does not exist on all ISA versions we support though.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-10 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2020-01-11
 Ever confirmed|0   |1

--- Comment #5 from Andrew Pinski  ---
> __lwarx()
> __ldarx()
> __stwcx()
> __stdcx()

Is there a reason why the __atomic_* builtins don't work?

> __protected_stream_set()
> __protected_stream_count()
> __protected_stream_count_depth() // currently not implemented in gcc
> __protected_stream_go()

These are the most useful ones really.

>the cntlz ones are not, for example

:)  It has been a long time since I touched this but I would not doubt I messed
up this too.

__mulh* intrinsics are better implemented these days using either 64bit or
128bit multiples.

__l[hwd]brx/__st[hwd]brx intrinsics are better implemented as __builtin_bswap*
followed by load/stored these days (the bswap builtins did not exist back then
or optimized)

Many of the other intrinsics should be implemented as non inline-asm too, even
fma, should be done using __builtin_fma :).

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-10 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #4 from Matt Emmerton  ---
The intrinsics that we would find useful, having used them as provided by the
IBM XL C/C++ compiler, are the following:

__sync()
__isync()
__lwsync()

__dcbt()
__dcbtst()

__lwarx()
__ldarx()
__stwcx()
__stdcx()

__protected_stream_set()
__protected_stream_count()
__protected_stream_count_depth() // currently not implemented in gcc
__protected_stream_go()

The implementation of stwcx() and stdcx() need revision on PPC.
As I understand it, there is no need the mfocrf instruction nor the
mask-and-shift on result.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-10 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #3 from Segher Boessenkool  ---
Okay, I'll bite.

Which of the functions/macros in ppu_intrinsics.h would you find useful?
Have you checked whether the implementation is good for your purpose, or
if they even are correct (the cntlz ones are not, for example)?

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-08 Thread memmerto at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #2 from Matt Emmerton  ---
This appears to have packaging complications by vendors as well :(

On powerpc-ibm-aix7.1.0.0 this doesn't get installed.
On ppc64le-redhat-linux it does.

However, both of these cases would benefit from something targeted specifically
to PPC, rather than PPU/Cell.

[Bug target/93177] PPC: Missing many useful platform intrinsics

2020-01-07 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93177

--- Comment #1 from Andrew Pinski  ---
ppu_intrinsics.h is installed for all powerpc* configs.  Though to use it you
need to compile with -mcpu=cell :)