[Bug tree-optimization/77647] New: Missed opportunity to use register value causes additional load

2016-09-19 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77647

Bug ID: 77647
   Summary: Missed opportunity to use register value causes
additional load
   Product: gcc
   Version: 6.2.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---

static inline long load(long *p)
{
long ret;
asm ("movq  %1,%0\n\t" : "=r" (ret) : "m" (*p));
if (ret != *p)
__builtin_unreachable();
return ret;
}

long foo(long *mem)
{
long ret;
ret = load(mem);
return ret - *mem;
}

foo() compiles down to 'xorl %eax,%eax ; ret' which is great. Changing the
minus to plus gives

movq(%rdi),%rax
addq(%rdi),%rax
ret

[Bug middle-end/71509] Bitfield causes load hit store with larger store than load

2017-05-11 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509

Nicholas Piggin  changed:

   What|Removed |Added

 CC||npiggin at gmail dot com

--- Comment #5 from Nicholas Piggin  ---
This test case seems like it may be related. It does the right thing and uses
all 4 byte ops when the 8 byte alignment is removed. I post it here because it
may not always be the case that smallest op is best

struct s {
unsigned long align1;
union {
unsigned int blah;
unsigned int a:1;
};
};

void test2(struct s *s)
{
s->blah = 100;
if (s->a)
asm volatile("#blah");
}

Generates (gcc 7.0.1)

test2:
li 9,100
stw 9,8(3)
ld 9,8(3)
andi. 9,9,0x1
beqlr 0
#APP
 # 29 "a.c" 1
#blah
 # 0 "" 2
#NO_APP
blr

There is a more general case of mismatched load and store sizes in unions with
different size types, but in this case the sizes could be matched I think.

[Bug c/84514] New: powerpc sub optimal condition register reuse with extended inline asm

2018-02-22 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84514

Bug ID: 84514
   Summary: powerpc sub optimal condition register reuse with
extended inline asm
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---

Created attachment 43488
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43488=edit
test case with description in comment at the top

There seem to be some missed opportunities reusing condition register over
extended asm (that does not clobber cc).

[Bug tree-optimization/84443] New: powerpc suboptimal code generation (shrink wrap unlikely path) for Linux spinlocks

2018-02-18 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84443

Bug ID: 84443
   Summary: powerpc suboptimal code generation (shrink wrap
unlikely path) for Linux spinlocks
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---

Created attachment 43452
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43452=edit
testcase with comment at the top describing desired output

A small fast path code gets several non-volatile registers saved on stack, and
a sub-optimal restore in the return path.

The example is derived from (but not identical to) the Linux kernel spinlock
implementation.

Tested with gcc version 8.0.1 20180207 (experimental) [trunk revision 257435]
(Debian 8-20180207-2).

[Bug c/84483] New: powerpc sub-optimal code generation - conditional relative branch to and over blr

2018-02-20 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84483

Bug ID: 84483
   Summary: powerpc sub-optimal code generation - conditional
relative branch to and over blr
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---

Created attachment 43472
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43472=edit
test case with description in comment at the top

I have attached a test case derived loosely from Linux spinlock code that has
two possible types of sub-optimal code generation with relative branches used
to execute or avoid a blr. Conditional blr might be used instead.

[Bug target/84626] New: powerpc toc register is reloaded unnecessarily

2018-02-28 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84626

Bug ID: 84626
   Summary: powerpc toc register is reloaded unnecessarily
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---

gcc version 8.0.1 20180207 (experimental) [trunk revision 257435] (Debian
8-20180207-2):

Test case:

void test(void (*fn)(void), unsigned long i)
{ 
while (i--)
fn();
}

Generates code:

  2c:   18 00 41 f8 std r2,24(r1)
  30:   00 00 00 60 nop
  34:   00 00 00 60 nop
  38:   00 00 00 60 nop
  3c:   00 00 42 60 ori r2,r2,0
  40:   78 f3 cc 7f mr  r12,r30
  44:   a6 03 c9 7f mtctr   r30
  48:   ff ff ff 3b addir31,r31,-1
  4c:   21 04 80 4e bctrl
  50:   18 00 41 e8 ld  r2,24(r1)
  54:   ff ff bf 2f cmpdi   cr7,r31,-1
  58:   e8 ff 9e 40 bne cr7,40 <test+0x40>

The r2 load could be moved out of the loop.

[Bug target/84627] powerpc excess padding generated for POWER9 target

2018-02-28 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84627

--- Comment #2 from Nicholas Piggin  ---
Ah sorry, target is powerpc64le

[Bug target/84627] New: powerpc excess padding generated for POWER9 target

2018-02-28 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84627

Bug ID: 84627
   Summary: powerpc excess padding generated for POWER9 target
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---

gcc version 8.0.1 20180207 (experimental) [trunk revision 257435] (Debian
8-20180207-2):

Test case:

void test(void (*fn)(void), unsigned long i)
{ 
while (i--)
fn();
}

Generates code with -O2 -mcpu=power9:

  2c:   18 00 41 f8 std r2,24(r1)
  30:   00 00 00 60 nop
  34:   00 00 00 60 nop
  38:   00 00 00 60 nop
  3c:   00 00 42 60 ori r2,r2,0
  40:   78 f3 cc 7f mr  r12,r30
  44:   a6 03 c9 7f mtctr   r30
  48:   ff ff ff 3b addir31,r31,-1
  4c:   21 04 80 4e bctrl
  50:   18 00 41 e8 ld  r2,24(r1)
  54:   ff ff bf 2f cmpdi   cr7,r31,-1
  58:   e8 ff 9e 40 bne cr7,40 <test+0x40>

On power9, I wonder if nops and ori should be avoided? Aligning to quad word
boundary is reasonable for fetch, but is there any advantage for dispatch to
add the extra padding?

[Bug target/84627] powerpc excess padding generated for POWER9 target

2018-03-05 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84627

Nicholas Piggin  changed:

   What|Removed |Added

 Resolution|INVALID |FIXED

--- Comment #4 from Nicholas Piggin  ---
After some more discussion and testing, it was determined that for POWER9, 16
bytes of padding is sufficient for these small loops and was adjusted.

https://gcc.gnu.org/viewcvs/gcc?view=revision=258260

Author: segher
Date:   Mon Mar 5 19:11:54 2018 UTC (3 hours, 27 minutes ago)
Log Message:

rs6000: Don't align tiny loops to 32 bytes for POWER9

For POWER4..POWER8 we align loops of 5..8 instructions to 32 bytes
(instead of to 16 bytes) because that executes faster.  This is no
longer the case on POWER9, so we can just as well only align to 16
bytes.


* config/rs6000/rs6000.c (rs6000_loop_align): Don't align tiny loops
to 32 bytes when compiling for POWER9.

[Bug middle-end/84443] powerpc suboptimal code generation (shrink wrap unlikely path) for Linux spinlocks

2018-02-27 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84443

--- Comment #3 from Nicholas Piggin  ---
(In reply to Segher Boessenkool from comment #2)
> If you want some specific machine code for something complex like this, it
> is much easier to write the machine code than to fight the compiler.

Yes, understood. This may be the way we go eventually.

I thought the test case could be interesting for reference. Often times we do
juggle around the C/inline asm code to get a good pattern, which can be a bit
easier than writing it completely in asm. It does not have to be absolutely
optimal of course.

Thanks for taking a look and providing some explanation of the problem. Any
improvement would be great.

> 
> That said:
> 
> 1) "flags" is stored in the same register everywhere.  This is a problem in
> expand: it puts the return value in the same pseudo in the whole function.
> This is a bad idea because (if it did not) after optimisation the return
> value is just a hard register (r3) and putting all _other_ uses in the same
> register is a pessimisation; like here (and in many other cases, there are
> other PRs for this!) it causes that pseudo to be put in a callee-save
> register (r30).
> 
> 2) That should be fixed in expand (to enable other optimisations with the
> return pseudo), but IRA should also be smarter about live-range splitting
> for shrink-wrapping.
> 
> 3) Separate shrink-wrapping should deal with CRFs.  I have a prototype,
> it has a few small problems (we need to handle things a little differently
> for pretty much every ABI, sigh), and it does not help very much until the
> other problems are solved; GCC 9 work);
> 
> 4) Shrink-wrapping itself could also do more live-range splitting than it
> currently does.

[Bug target/88765] New: powerpc64le-linux-gnu sub-optimal code generation for builtin atomic ops

2019-01-08 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88765

Bug ID: 88765
   Summary: powerpc64le-linux-gnu sub-optimal code generation for
builtin atomic ops
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---
Target: powerpc64le-linux-gnu

gcc version 8.2.0 (Debian 8.2.0-4) 

Linux uses a lot of non-trivial operations, and implementing them with
compare_exchange results in sub-optimal code generation. A common one is
add_unless_zero, which is commonly used with RCU to take a reference count, or
fail if the last reference had already gone (which is a very rare case).

---
#include 

bool add_unless_zero(unsigned long *mem, unsigned long inc)
{
unsigned long val;
do {
val = __atomic_load_n(mem, __ATOMIC_RELAXED);
if (__builtin_expect(val == 0, false))
return false;
} while (__builtin_expect(!__atomic_compare_exchange_n(mem,
, val + inc, true,
__ATOMIC_RELAXED, __ATOMIC_RELAXED), false));

return true;
}
---
This compiles to:

add_unless_zero:
.L4:
ld 10,0(3)
cmpdi 7,10,0
add 8,10,4
beq 7,.L5
ldarx 9,0,3
cmpd 0,9,10
bne 0,.L3
stdcx. 8,0,3
.L3:
bne 0,.L4
li 3,1
blr
.L5:
li 3,0
blr

Better would be

add_unless_zero:
.L4:
ldarx 9,0,3
cmpdi 0,9,0
add 9,9,4
bne 0,.L5
stdcx. 8,0,3
bne 0,.L4
li 3,1
blr
.L5:
li 3,0
blr

Using extended inline asm to implement these is an adequate workaround.
Unfortunately that does not work on 128 bit powerpc because no register pair
constraint, and much worse code generation with builtins. Changing types to
__int128_t gives a bad result:

add_unless_zero:
lq 10,0(3)
mr 6,3
or. 9,10,11
addc 3,11,4
mr 7,11
adde 11,10,5
beq 0,.L16
std 28,-32(1)
std 29,-24(1)
std 30,-16(1)
std 31,-8(1)
.L12:
lqarx 28,0,6
xor 9,29,7
xor 10,28,10
or. 9,9,10
bne 0,.L4
mr 30,11
mr 31,3
stqcx. 30,0,6
.L4:
mfcr 3,128
rlwinm 3,3,3,1
bne 0,.L17
.L2:
ld 28,-32(1)
ld 29,-24(1)
ld 30,-16(1)
ld 31,-8(1)
blr
.L17:
lq 10,0(6)
or. 9,10,11
addc 3,11,4
mr 7,11
adde 11,10,5
bne 0,.L12
li 3,0
b .L2
.L16:
li 3,0
blr

Closer to ideal would be

add_unless_zero:
.Lagain:
   lqarx   6,0,3
   or. 8,6,7
   addc6,6,4
   adde7,7,5
   beq 0,.Lfail
   stqcx.  6,0,3
   bne 0,.Lagain
   li  3,1
   blr
.Lfail:
   li  3,0
   blr

[Bug target/94393] Powerpc suboptimal 64-bit constant comparison

2020-04-02 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94393

--- Comment #4 from Nicholas Piggin  ---
(In reply to Alan Modra from comment #3)
> There are two parts to fixing this PR.
> 1) We can do better in the sequences generated for some constants.
> 2) rs6000_rtx_costs needs to be accurate, so that expensive constants are
> put in memory and stay there with optimisation.
> 
> Having looked at part 2 a little, I'd say fixes for that are definitely not
> stage 4 material.

Well the other part I thought is that different branch test could be chosen in
order to use the cheapest constant. If you generate the 0xbfff... constant more
cheaply it'd still be more expensive than 0xc000...

That was my intention opening the two bugs but I didn't explain myself so well.

[Bug target/94393] New: Powerpc suboptimal 64-bit constant comparison

2020-03-29 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94393

Bug ID: 94393
   Summary: Powerpc suboptimal 64-bit constant comparison
   Product: gcc
   Version: 9.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---

--- test case 
void test1(unsigned long a)
{ 
if (a > 0xc000ULL)
printf("yes\n");
}
void test2(unsigned long a)
{
if (a >= 0xc000ULL)
printf("yes\n");
}
--

The first (important part) compiles to

li 9,-1
rldicr 9,9,0,1
cmpld 0,3,9
blelr 0

The second to

lis 9,0xbfff
ori 9,9,0x
sldi 9,9,32
oris 9,9,0x
ori 9,9,0x
cmpld 0,3,9
blelr 0

The second could use the same 2-insn constant as the first, but with bltlr.

[Bug target/94395] New: Powerpc suboptimal 64-bit constant generation near large values with few bits set

2020-03-29 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94395

Bug ID: 94395
   Summary: Powerpc suboptimal 64-bit constant generation near
large values with few bits set
   Product: gcc
   Version: 9.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---

0xc000UL is generated with

li 9,-1
rldicr 9,9,0,1

0xbfffUL (0xc000UL - 1) is

lis 9,0xbfff
ori 9,9,0x
sldi 9,9,32
oris 9,9,0x
ori 9,9,0x

Could be

li 9,-1
rldicr 9,9,0,1
subi 9,9,1

[Bug rtl-optimization/96475] New: direct threaded interpreter with computed gotos generates suboptimal dispatch loop

2020-08-04 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96475

Bug ID: 96475
   Summary: direct threaded interpreter with computed gotos
generates suboptimal dispatch loop
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
CC: segher at gcc dot gnu.org
  Target Milestone: ---
Target: powerpc64le-linux-gnu

Created attachment 48999
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48999=edit
test case

The attached test case code generation with -O2 for run_program_goto generates
a central indirect branch dispatch to handlers that branch back to the central
dispatcher.

Direct threaded code with indirect branches between handlers is faster on a
POWER9 when there are no branch mispredictions due to fewer branches, and it
should generally do better with branch prediction when there is an indirect
branch from each handler.

[Bug target/96017] New: Powerpc suboptimal register spill in likely path

2020-07-01 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96017

Bug ID: 96017
   Summary: Powerpc suboptimal register spill in likely path
   Product: gcc
   Version: 9.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---
Target: powerpc64le-linux-gnu
 Build: gcc version 9.2.1 20190909 (Debian 9.2.1-8)

-- test.c --
extern int foo;
extern void slowpath(int *);

int test(int *val)
{
int ret = foo;

if (__builtin_expect(*val != 0, 0))
slowpath(val);

return ret;
}
--

Compiling with -O2 gives the following asm. It seems possible for the fast path
case to avoid the stack frame by using a volatile register to save the val
argument in case the slow path needs it (or alternatively to save the load from
'foo', as r31 is used now, but that requires an extra register move on a
critical path for the return value). This should be smaller and faster code
even for the slow path too.

addis   r2,r12,0
addir2,r2,0
lwz r9,0(r3)
addis   r10,r2,0
ld  r10,0(r10)
std r31,-8(r1)
stdur1,-48(r1)
lwa r31,0(r10)
cmpwi   r9,0
bne 1f
addir1,r1,48
mr  r3,r31
ld  r31,-8(r1)
blr
nop
ori r2,r2,0
1:  mflrr0
std r0,64(r1)
bl  slowpath
nop
ld  r0,64(r1)
addir1,r1,48
mr  r3,r31
ld  r31,-8(r1)
mtlrr0
blr

[Bug target/96017] Powerpc suboptimal register spill in likely path

2020-07-01 Thread npiggin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96017

--- Comment #3 from Nicholas Piggin  ---
This is possibly a more targeted and simpler test case for at least one of the
problems in bug 84443. I would like to re-test that case after this one is
solved if it's not an obvious duplicate.

[Bug rtl-optimization/102062] powerpc suboptimal unrolling simple array sum

2021-08-25 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062

--- Comment #14 from Nicholas Piggin  ---
(In reply to Bill Schmidt from comment #10)
> Well, the problem is that we still generate suboptimal code on GCC 11.  I
> don't know whether we want to address that or not.
> 
> I suppose we aren't going to backport Haochen's lovely patch for sign
> extensions, and it's probably not worth messing around with variable
> expansion just for GCC 11...
> 
> Nick, can you use the -ftree-vectorize -fvect-cost-model=cheap to deal with
> this (or even use -O3, which is recommended and probably does the same
> thing)?  Or are you holding out for some fix in GCC 11.x?

For my case no but it's only a very small issue in the scheme of things, I only
found it by looking at code generation. So you can close immediately if it's
fixed upstream, or whatever else you think best. Thanks all.

[Bug c/102062] New: powerpc suboptimal unrolling simple array sum

2021-08-25 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062

Bug ID: 102062
   Summary: powerpc suboptimal unrolling simple array sum
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---
Target: powerpc64le-linux-gnu

--- test.c ---
int test(int *arr, int sz)
{
int ret = 0;
int i;

if (sz < 1)
__builtin_unreachable();

for (i = 0; i < sz*2; i++)
ret += arr[i];

return ret;
}
---

gcc-11 compiles this to:
test:
rldic 4,4,1,32
addi 10,3,-4
rldicl 9,4,63,33
li 3,0
mtctr 9
.L2:
addi 8,10,4
lwz 9,4(10)
addi 10,10,8
lwz 8,4(8)
add 9,9,3
add 9,9,8
extsw 3,9
bdnz .L2
blr

I may be unaware of a constraint of C standard here, but maintaining the two
base addresses seems pointless, so is beginning the first at offset -4.

The bigger problem is keeping a single sum. Keeping two sums and adding them at
the end reduces critical latency of the loop from 6 to 2, which brings
throughput on large loops from 6 cycles per iteration down to about 2.2 on
POWER9 without harming short loops:

test:
rldic 4,4,1,32
rldicl 9,4,63,33
mtctr 9
li 8,0
li 9,0
.L2:
lwz 6,0(3)
lwz 7,4(3)
addi 3,3,8
add  8,8,6
add  9,9,7
bdnz .L2
add 9,9,8
extsw 3,9
blr

Any reason this can't be done?

[Bug c/102062] powerpc suboptimal unrolling simple array sum

2021-08-25 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062

--- Comment #3 from Nicholas Piggin  ---
(In reply to Bill Schmidt from comment #2)
> As expected, I get similar code when compiling either for P9 or P10.

Oh I should have specified, -O2 is the only option. If I add
-fvariable-expansion-in-unroller it has no effect, just to make sure.

It's gcc from Debian (gcc version 11.2.0 (Debian 11.2.0-3)). Maybe they've done
something to change this.

[Bug c/102062] powerpc suboptimal unrolling simple array sum

2021-08-25 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062

--- Comment #5 from Nicholas Piggin  ---
(In reply to Bill Schmidt from comment #2)
> As expected, I get similar code when compiling either for P9 or P10.

Oh I should have specified, -O2 is the only option. If I add
-fvariable-expansion-in-unroller it has no effect, just to make sure.

It's gcc from Debian (gcc version 11.2.0 (Debian 11.2.0-3)). Maybe they've done
something to change this.(In reply to Bill Schmidt from comment #1)
> Regarding the latter question, I'm surprised it's not being done.  This
> behavior is controlled by -fvariable-expansion-in-unroller, which was
> enabled by default for PowerPC targets a couple of releases back.  You
> reported this against GCC 11.2, but I'm skeptical.  What options are you
> using?
> 
> Compiling with -O2 and current trunk, I see variable expansion kicking in,
> and I also see the same base register in use in all references in the loop:
> 
> test:
> .LFB0:
> .cfi_startproc
> .localentry test,1
> slwi 4,4,1
> li 10,0
> li 7,0
> addi 9,3,-4
> extsw 4,4
> andi. 6,4,0x3
> addi 5,4,-1
> mr 8,4
> beq 0,.L9
> cmpdi 0,6,1
> beq 0,.L13
> cmpdi 0,6,2
> bne 0,.L22
> .L14:
> lwzu 6,4(9)
> addi 4,4,-1
> add 10,10,6
> .L13:
> lwzu 6,4(9)
> cmpdi 0,4,1
> add 10,10,6
> beq 0,.L19
> .L9:
> srdi 8,8,2
> mtctr 8
> .L2:
> lwz 4,4(9)
> lwz 5,12(9)
> lwz 6,8(9)
> lwzu 8,16(9)
> add 10,4,10
> add 10,10,5
> add 7,6,7
> add 7,7,8
> bdnz .L2
> .L19:
> add 3,10,7
> extsw 3,3
> blr
> .p2align 4,,15
> .L22:
> lwz 10,0(3)
> mr 9,3
> mr 4,5
> b .L14

That asm does well on the test, better than my version (a little bit on P9, a
lot on P10). It does have 2x more unrolling which probably helps a bit.

[Bug c/102169] New: powerpc64 int memory operations using FP instructions

2021-09-01 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102169

Bug ID: 102169
   Summary: powerpc64 int memory operations using FP instructions
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
CC: bergner at gcc dot gnu.org
  Target Milestone: ---
Target: powerpc64le-linux-gnu

--- test.c ---
int foo, bar;

void test(void)
{
foo = bar;
}
---

Using Debian gcc 10.2 with -O2 flag, this compiles to:

 addis   r2,r12,0
 addir2,r2,0
 addis   r9,r2,0
 addir9,r9,0
 lfiwzx  f0,0,r9
 addis   r9,r2,0
 addir9,r9,0
 stfiwx  f0,0,r9
 blr

Peter confirmed it also uses FP registers with trunk (but I don't have the asm
output at hand).

This can be suboptimal on some processors, e.g., on POWER9 lfiwzx is "Tuple
Restricted (R)" which reduces dispatch throughput on the cycle it is
dispatched. And generally just seems like a surprising thing to do with no
shortage of GPRs.

[Bug c/102239] New: powerpc suboptimal boolean test of contiguous bits

2021-09-08 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239

Bug ID: 102239
   Summary: powerpc suboptimal boolean test of contiguous bits
   Product: gcc
   Version: 11.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---
Target: powerpc64le-linux-gnu

gcc version 11.2.1 20210815 (Debian 11.2.0-2) 

Build flags -O2

--- test.c ---
void foo(long arg)
{
if (arg & ((1UL << 33) | (1UL << 34)))
asm volatile("# if");
else
asm volatile("# else");
}
---

generates:

foo:
rldicr 3,3,29,1
srdi. 3,3,29
beq 0,.L6
# if
blr
.L6:
# else
blr

This test of multiple contiguous bits could be tested with a single instruction
with rldicr. or rldicl.

Those instructions do tend to be more expensive than the Rc=0 form (C2 cracked
on POWER9, slower pipe class on POWER10), but I think they should be better
than the current 2-instruction sequence.

[Bug rtl-optimization/102062] powerpc suboptimal unrolling simple array sum

2021-09-22 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062

Nicholas Piggin  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #16 from Nicholas Piggin  ---
gcc-11 -O2 -mcpu=power10 now generates identical code that Bill listed for
trunk:

.L2:
lwz 4,4(9)
lwz 5,12(9)
lwz 6,8(9)
lwzu 8,16(9)
add 10,4,10
add 10,10,5
add 7,6,7
add 7,7,8
bdnz .L2

Thanks all.

[Bug target/102485] -Wa,-many no longer has any effect

2022-02-23 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102485

--- Comment #9 from Nicholas Piggin  ---
And upstream gas still doesn't even warn with -many!!

[Bug c/104671] New: -Wa,-m no longer has any effect

2022-02-23 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104671

Bug ID: 104671
   Summary: -Wa,-m no longer has any effect
   Product: gcc
   Version: 10.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
CC: amodra at gcc dot gnu.org, bergner at gcc dot gnu.org,
dje at gcc dot gnu.org, meissner at gcc dot gnu.org, pc at 
us dot ibm.com,
segher at gcc dot gnu.org, wschmidt at gcc dot gnu.org
  Target Milestone: ---

Commit e154242724b084380e3221df7c08fcdbd8460674 ("Don't pass -many to the
assembler") also added a workaround emitting a ".machine" directive to the top
of each generated .s file corresponding to -mcpu= option, to work around buggy
build code which uses incorrect -Wa,-mcpu options:

  This patch also emits .machine assembler directives for ELF targets
  when functions are compiled for different cpus via attributes or
  pragmas.  That's necessary when the initial -m option passed to
  the assembler doesn't enable the superset of all opcodes emitted, as
  seen by a failure of gcc.target/powerpc/clone2.c without .machine
  when building gcc for power8.

This also prevents passing a -m to the assembler which is a strict
superset of the -mcpu= generated code.

This is a valid situation where .c code is built with a baseline architecture
compatibility plus dynamic support for newer instructions with inline asm. The
Linux kernel extensively uses this pattern.

[Bug c/104671] -Wa,-m no longer has any effect

2022-02-23 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104671

--- Comment #1 from Nicholas Piggin  ---
The comment in recent binutils.git commit cebc89b9328 sheds some more light on
this and possibly provides a workaround in binutils for the errant .machine
directive.

The referenced gcc bug #101393 looks like it was proposed to remove .machine
but I'm not entirely sure it was also talking about other things, and that
hasn't been resolved.

[Bug target/102485] -Wa,-many no longer has any effect

2022-02-23 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102485

--- Comment #8 from Nicholas Piggin  ---
(In reply to Segher Boessenkool from comment #7)
> > GCC already passes -m to the assembler though.
> 
> That mostly is historic.

So? I was pointing out the compiler already tells the assembler what
instruction set to use without the .machine directive.

> 
> > The justification for emitting the .machine directive is given as fixing a
> > build breakage due to a build system that passes an incorrect -m to the
> > assembler.
> 
> Not really, no. 

That is really the justification for emitting the .machine directive as
provided in the changelog of the commit which introduced the change.

> That is just one tiny part of the problem.  It is impossible
> to know what instruction sets we need ahead of time, and -many cannot work
> (and
> *does not* work: there are quite a few mnemonics that encode to different
> insns
> on different architecture versions (or for different CPUs), and we cannot
> know
> which is wanted, or which is preferred, ahead of time.

I understand the problems with -many, but it can and does work for real
software. E.g., Linux kernel as of before this change. It's not -many I'm
wedded to though, it's any ability to fix this sanely because of the .machine
directive.

The kernel should would change to using a specific CPU, e.g., -mcpu=power4
-Wa,-mpower10 and deal with the very rare few clashing mnemonics (e.g., tlbie)
specially with the .machine directive when an older one is required.

> 
> > *That* is the broken code (if any) that should have been fixed. But instead
> > that is hacked around in a way that breaks working code that passes down
> > -Wa,-many option as specified.
> 
> There is no working code that uses -many (accept by accident, if no problem
> has hit you yet).

I'll reword. "Instead that is hacked around in a way that breaks working code
that passes down the -Wa,-m option as specified."

> 
> > The kernel builds with a base compatibility (say -mcpu=power4) and then has
> > certain code paths that are dynamically taken if running on newer CPUs which
> > use newer instructions with inline asm.
> > 
> > This is an important usage and it's pervasive, it seems unreasonable to
> > break it.  Adding .machine directives throughout inline asm for every
> > instruction not in the base compatibility class is pretty horrible.
> 
> It is the only correct thing to do.

It's not. Not passing .machine and passing -mcpu is just as correct. With the
added bonus that it allows software to use a superset of instructions in such
cases. And even the great bonus that existing "broken" code that uses -many
will continue to work.

The correct way to deal with this is not to break this, it is to add a warning
to -many for some period to binutils to give code a chance to migrate. I'm all
for removing -many, and that is the right way to do it. By deprecating -many
and providing warnings. Not by hacking around it in the compiler that breaks
things.

[Bug target/104671] -Wa,-m no longer has any effect

2022-02-24 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104671

--- Comment #4 from Nicholas Piggin  ---
(In reply to Alan Modra from comment #3)
> (In reply to Nicholas Piggin from comment #0)
> > Commit e154242724b084380e3221df7c08fcdbd8460674 ("Don't pass -many to the
> > assembler") also added a workaround emitting a ".machine" directive to the
> > top of each generated .s file
> 
> Nope, wrong commit.  The added .machine there is one in response to an
> attribute or pragma selecting a different cpu.  ie. it's in response to user
> source code saying "compile this function for this particular cpu, not
> necessarily the one given on the gcc command line".
> 
> Commit b9f12a01bedb5 is the first commit that added .machine to the top of
> assembly files.  The early .machine was not always emitted by that
> particular commit.  Always emitting the early .machine came later.

Thanks for the correction.

And thank you for taking the time to write the detailed explanation in bug
#102485, it's very helpful. Not sure where it leaves this report wrt gcc, but
upstream binutils solves the problem for me so I would be happy to consider it
resolved.

It is what it is, we can patch Linux to fix the build issues, the proposed
fixes don't look too onerous. And we need to remove -many from our build anyway
(should have done so a long time ago) so this will give the necessary prod.

[Bug target/102485] -Wa,-many no longer has any effect

2022-02-23 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102485

Nicholas Piggin  changed:

   What|Removed |Added

 CC||npiggin at gmail dot com

--- Comment #6 from Nicholas Piggin  ---
(In reply to Peter Bergner from comment #1)
> I agree it is GCC's job to emit a ".machine CPU" directive that allows the
> assembler to correctly assemble the code GCC generates.

GCC already passes -m to the assembler though.

The justification for emitting the .machine directive is given as fixing a
build breakage due to a build system that passes an incorrect -m to the
assembler.

*That* is the broken code (if any) that should have been fixed. But instead
that is hacked around in a way that breaks working code that passes down
-Wa,-many option as specified.

>  Your test case
> however uses inline asm and GCC does not know that you are using the mfppr32
> mnemonic.  The assembler code you write in an inline asm is basically a
> black box to the compiler.  It is therefor up to the programmer to ensure
> that either the -mcpu=CPU GCC option that is being used (which
> emits".machine CPU" directive) is enough to assemble the mnemonics in the
> inline asm or you have to emit them in your inline asm.

The kernel builds with a base compatibility (say -mcpu=power4) and then has
certain code paths that are dynamically taken if running on newer CPUs which
use newer instructions with inline asm.

This is an important usage and it's pervasive, it seems unreasonable to break
it.  Adding .machine directives throughout inline asm for every instruction not
in the base compatibility class is pretty horrible.

[Bug c/106895] New: powerpc64 strange extended inline asm behaviour with register pairs

2022-09-09 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895

Bug ID: 106895
   Summary: powerpc64 strange extended inline asm behaviour with
register pairs
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
CC: segher at gcc dot gnu.org
  Target Milestone: ---
Target: powerpc64le-linux-gnu

This is Debian 12.2.0-1 build.

powerpc64 instructions operating on 128 bits use "even/odd pair" of adjacent
GPRs, where the lower numbered register specifies the pair and it must be an
even number. This code compiled with -O2:

void set128(__int128 val, __int128 *mem)
{
#if WORKAROUND
asm("");
#endif
asm("stq %1,%0" : "=m"(*mem) : "r"(val));
}

Results in an invalid even/odd pair:

stq 3,0(5)
blr

If compiled with -DWORKAROUND it results in a valid pair:

mr 10,3
mr 11,4
stq 10,0(5)
blr

[Bug target/108239] New: -mprefixed causes too large displacements for extended inline asm memory operands

2022-12-27 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108239

Bug ID: 108239
   Summary: -mprefixed causes too large displacements for extended
inline asm memory operands
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
CC: segher at gcc dot gnu.org
  Target Milestone: ---
Target: powerpc64le-linux-gnu

--- test.c ---
// powerpc64le-linux-gnu-gcc -O2 -mcpu=power10 -mno-pcrel -c test.c

#include 

static inline uint32_t readl(uint32_t *addr)
{
uint32_t ret;
__asm__ __volatile__("lwz %0,%1" : "=r" (ret) : "m" (*addr));
return ret;
}

int test(void *addr)
{
return readl(addr + 0x8024);
}
---

This generates invalid assembly 'lwz  3,32804(3)' with displacement exceeding
allowable size. Using -mno-prefixed fixes the problem.

[Bug c/108018] New: Wide immediate sequences not scheduled for POWER10 fusion

2022-12-07 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108018

Bug ID: 108018
   Summary: Wide immediate sequences not scheduled for POWER10
fusion
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: npiggin at gmail dot com
  Target Milestone: ---

POWER10 has "wideimmediate" fusion sequences that include addi rx,ra,si ; addis
rx,rx,SI and addis rx,ra,si ; addi rx,rx,SI

--- test.c ---
static int foo, bar;

unsigned long test(void)
{
return (unsigned long) + (unsigned long)
}
---

This test case when compiled with -O2 -mcpu=power10 -mno-pcrel results in

addis 2,12,.TOC.-.LCF0@ha
addi 2,2,.TOC.-.LCF0@l
addis 3,2,.LANCHOR0@toc@ha
addis 9,2,.LANCHOR0+4@toc@ha
addi 3,3,.LANCHOR0@toc@l
addi 9,9,.LANCHOR0+4@toc@l
add 3,3,9
blr

The TOC pointer generation is scheduled properly because it is the global entry
prologue, but the variable address generation is scheduled to favour
dependencies rather than the static fusion sequences that P10 has, which should
be preferable.

[Bug target/106895] powerpc64 strange extended inline asm behaviour with register pairs

2023-06-15 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895

Nicholas Piggin  changed:

   What|Removed |Added

 Resolution|INVALID |---
 Status|RESOLVED|UNCONFIRMED

--- Comment #4 from Nicholas Piggin  ---
I would like to reopen this. The example provided is not the desired operation,
it was just an example.

Such register-pair instructions in Power are not usable in extended inline asm
without this (except maybe by explicit fixed allocation with clobbers or
register variables which is undesirable).

[Bug target/106895] powerpc64 unable to specify even/odd register pairs in extended inline asm

2023-07-06 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895

--- Comment #11 from Nicholas Piggin  ---
(In reply to Segher Boessenkool from comment #10)
> (In reply to Nicholas Piggin from comment #9)
> > I don't know why constraint is wrong and mode is right
> 
> Simple: you would need O(2**T*N) constraints for our existing N register
> constraints, together with T features like this.  But only O(2**T) modes at
> most.

I guess that would be annoying if you couldn't have modifiers on constraints or
a bad algorithm for working them out. Fair enough.

> 
> > or why TI doesn't work but PTI apparently would,
> 
> Because this is exactly what PTImode is *for*!

Right I accept it is, I meant I just would not have been able to work it out
(assuming if PTI was documented it would be "Partial Tetra Integer" and be no
more useful than the other P?I type documentation.

> 
> > but I'll take anything that works. Could we
> > get PTI implemented? Does it need a new issue opened?
> 
> It was implemented in 2013.  The restriction to only even pairs was a bugfix,
> also from 2013.
> 
> If you have code like
> 
>   typedef __int128 __attribute__((mode(PTI))) even;
> 
> you get an error like
> 
>   error: no data type for mode 'PTI'
> 
> This needs fixing.  You can keep it in this PR?

Sure,  that would be much appreciated.

[Bug target/106895] powerpc64 unable to specify even/odd register pairs in extended inline asm

2023-07-08 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895

--- Comment #13 from Nicholas Piggin  ---
(In reply to Segher Boessenkool from comment #12)
> > I guess that would be annoying if you couldn't have modifiers on constraints
> 
> There is no such thing as "operand modifiers".  There are *output* modifiers:
> they change how an operand is *printed*, they do not change the operand in
> any
> way, shape, or form.

No I mean like x, y, z for different register pairs, or Pn where n=0..4, eI,
eP, eQ for different immediates.

If they're practically just implemented as another constraint, what is really
the big deal with adding a even/odd GPR pair constraint anyway? You say it's
2^T^N additional constraints. As far as I can tell it turns out to be one extra
constrain, or maybe 2 if you need one that excludes r0. 

> 
> > or a bad algorithm for working them out. Fair enough.
> 
> No idea what you mean here?

If you can't deal well with more constraints.

> 
> > > > or why TI doesn't work but PTI apparently would,
> > > 
> > > Because this is exactly what PTImode is *for*!
> > 
> > Right I accept it is, I meant I just would not have been able to work it out
> > (assuming if PTI was documented it would be "Partial Tetra Integer" and be
> > no more useful than the other P?I type documentation.
> 
> For the rs6000 port, multi-register operands are not restricted to aligned
> register numbers ("even/odd pairs").  (Some other ports do have this).  We
> use
> the existing PTI mode for that (it also can be allocated in GPRs only, never
> in
> VSRs, unlike TImode).
> 
> "Partial" does not have much meaning here.  A minority of ports use partial
> integer words for what they were introduced for originally: modes that are
> smaller than a full register, say, a 24-bit mode when registers are 32 bits.
> 
> We use it as another integer mode that is the same size.  It is unfortunate
> that we still have to resort to such tricks.

Okay that explains it, thank you.

[Bug target/106895] powerpc64 unable to specify even/odd register pairs in extended inline asm

2023-07-06 Thread npiggin at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895

--- Comment #9 from Nicholas Piggin  ---
(In reply to Segher Boessenkool from comment #8)
> (In reply to Peter Bergner from comment #6)
> > (In reply to Segher Boessenkool from comment #5)
> > > Constraints are completely the wrong tool for this.  Just use modes, which
> > > *are* the right tool?
> > 
> > Well you cannot specify modes in the asm, so I think you're saying we need
> > use the correct type that maps to a internal to GCC mode that has the
> > even/odd register behavior, so something like:
> > 
> >   unsigned int foo __attribute__ ((mode (XX)));
> > 
> > ...where XXmode is the new integer mode that gives us even/odd register
> > pairs?  Of course we have to be careful about how this all works wrt -m32
> > versus -m64.
> 
> No, the type there is "unsigned int".  I meant to say exactly what I did say:
> just use modes.  Which you indeed do in user code by the mode attribute, yes.
> 
> And you do not need a new mode: PTImode should just work.  But the user
> specifying that is currently broken it seems?

I don't know why constraint is wrong and mode is right or why TI doesn't work
but PTI apparently would, but I'll take anything that works. Could we get PTI
implemented? Does it need a new issue opened?

> 
> Without -mpowerpc64 you cannot *have* 128-bit integers in registers.  That
> should be
> fixed, but you cannot have it in just *two* registers, which is what is
> required
> here.  For most targets that then means -m64 is required.