[Bug target/114732] ge can't be reversed to unlt for bcd compares

2024-04-16 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114732

--- Comment #6 from HaoChen Gui  ---
(In reply to Segher Boessenkool from comment #3)
> 1001, 0101, 0011 I mean of course.
> 
> In some ways CCmode models this better than CCFPmode, but we do not actually
> model
> the SO bit (bit 3) at all in CCmode.  It is a nice feature of CCmode (that we
> actually use as fundamental, in the backend code) that CCmode always has
> exactly
> one of three bits "hot" (and CCFPmode always one of four).  Bit 3 (SO) in
> CCmode
> is treated as not being part of the CC really, but an extra thing.  This
> doesn't
> work all that well of course.
> 
> So we really need st least three CC modes:
> 
> -- Exactly one of bits 0..3 hot, like CCFPmode;
> -- Exactly one of bits 0..2 hot, bit 3 independently set, like CCmode (and
>that independent bit 3 modeled nicely as well, unlike what we have), and
>also like in the BCD insns;
> -- Bit 0 is all-true, bit 2 is all-false, like in the vcmp* insns.
> 
> Do we need some other CC mode as well?  Doe we want separately named CC modes
> for the different variants of this (like the integer CC mode vs. the BCD
> one)?
> We already have a separate CCUNSmode which is exactly like CCmode, as far as
> the hardware cares, but the meaning is different (for CCUNS the LT and GT
> bits
> are set based on an unsigned integer compare, not a signed one).  There also
> is CCEQmode, which has only bit 2 valid (we use it for constructing one CR
> bit
> from others, like with cror or crnot).

Thanks for your comment. We also have VSX scalar test data class (xststdc*
insns) which use bit 0 as sign and bit 2 as matched. I think they can share
with vcmp* insns.

Could you advice if rtl code "unorder" is suitable for bcd overflow and invalid
bit test? Or we need to create UNSPECs for overflow and invalid bit test.
Thanks.

[Bug target/114732] ge can't be reversed to unlt for bcd compares

2024-04-16 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114732

--- Comment #1 from HaoChen Gui  ---
A straightforward test case.  It passes when compiling with O0 and aborts when
compiling with O2.

//test.c
#include 

#define BCD_POS0  12//  0xC
#define BCD_NEG   13//  0xD

void abort (void);

vector unsigned char maxbcd (unsigned int sign)
{
  vector unsigned char result;
  int i;

  for (i = 15; i > 0; i--)
result[i] = 0x99;

  result[0] = 0x9 << 4 | sign;

  return result;
}


__attribute__ ((noinline))
int foo (vector unsigned char a, vector unsigned char b)
{
  return __builtin_vec_bcdsub_ge (a, b, 0) != 1;
}

int main(void)
{
  vector unsigned char a = maxbcd (BCD_POS0);
  vector unsigned char b = maxbcd (BCD_NEG);

  if (foo(a, b))
abort ();

  return 0;
}

[Bug target/114732] New: ge can't be reversed to unlt for bcd compares

2024-04-15 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114732

Bug ID: 114732
   Summary: ge can't be reversed to unlt for bcd compares
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

//test.c
int
foo (vector unsigned char a, vector unsigned char b)
{
  return __builtin_vec_bcdsub_ge (a, b, 0) != 1;
}

//assembly
bcdsub. 2,2,3,0
cror 26,24,27
mfcr 3,2
rlwinm 3,3,27,1
blr


Here ge is reversed to unlt in combine pass. 
Trying 9 -> 10:
9: r128:SI=%6:CCFP>=0
  REG_DEAD %6:CCFP
   10: r127:SI=r128:SI^0x1
  REG_DEAD r128:SI
Successfully matched this instruction:
(set (reg:SI 127)
(unlt:SI (reg:CCFP 106 6)
(const_int 0 [0])))
allowing combination of insns 9 and 10
original costs 12 + 4 = 16
replacement cost 12
deferring deletion of insn with uid = 9.
modifying insn i310: r127:SI=unlt(%6:CCFP,0)
  REG_DEAD %6:CCFP
deferring rescan insn with uid = 10.

But it's wrong for bcd. The ge should be reversed to lt for bcd. The unorder
bit (actually it's overflow) doesn't matter. The root cause is bcd operations
use CCFP and CCFP allows reverse ge to unlt. So the bcd operations should use a
seperate CCmode.

[Bug target/93802] gcc generates a rlwinm/or pair instead of a single rlwimi (powerpc)

2024-03-05 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93802

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED
 CC||guihaoc at gcc dot gnu.org

--- Comment #2 from HaoChen Gui  ---
It's already fixed by f4a3cea3fb02. Now it generates a single rlwimi.
  rlwimi 3,3,16,0,31-16

[Bug target/113325] New: unnecessary byte swap for memory clear

2024-01-10 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113325

Bug ID: 113325
   Summary: unnecessary byte swap for memory clear
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

//test case

void* foo (void* s1)
{
  return __builtin_memset (s1, 0, 32);
}

//assembly
vspltisw 0,0
li 10,16
xxpermdi 0,32,32,2
stxvd2x 0,0,3
stxvd2x 0,3,10
blr

The xxpermdi is unnecessary. The problem occurs on P8 LE.

[Bug target/112707] [14 regression] gcc 14 outputs invalid assembly on ppc: Error: unrecognized opcode: `fctid'

2023-12-10 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112707

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #17 from HaoChen Gui  ---
fixed

[Bug target/112707] [14 regression] gcc 14 outputs invalid assembly on ppc: Error: unrecognized opcode: `fctid'

2023-11-27 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112707

--- Comment #9 from HaoChen Gui  ---
(In reply to Segher Boessenkool from comment #8)
> Yeah, it tested for ISA 2.04 before.  That was an attempt at including 476
> probably?
> 
> We really should have a TARGET_FCTID, on for TARGET_POWERPC64 or for cpu 476
> (so
> NOT user-selectable separately, of course!); not try to use pre-existing
> flags for this, which might work but will forever stay confusing.
> 
> So either a separate OPTION_FCTID for in rs6000-cpus.def, or TARGET_FCTID. 
> Either
> works for me.
> 
> (Background: in ISA 1.xx it was for 64-bit implementations only.  But it
> does not
> need 64-bit registers or a 64-bit integer pipeline at all, it is an FP
> instruction
> that works on FP registers, which always are 64-bit.  The instruction was
> implemented
> on the 476).

Thanks for your explanation.

I found "fctid" is supported on PPC64 and PPC476 from assembler source code.
{“fctid”,   XRC(63,814,0),  XRA_MASK,PPC64, PPCVLE, {FRT,
FRB}},
{“fctid”,   XRC(63,814,0),  XRA_MASK,PPC476,PPCVLE, {FRT,
FRB}},

But powerpc7450 only enables PPC. That's why assembler complains.
  { "7450",PPC_OPCODE_PPC | PPC_OPCODE_7450 | PPC_OPCODE_ALTIVEC, 0 },

My question is: can "fctid" be executed on powerpc7450 such a 32bit processor?
If it's supported, should the assembler be changed also (replace the PPC64 with
PPC for fctid)?

[Bug target/112707] [14 regression] gcc 14 outputs invalid assembly on ppc: Error: unrecognized opcode: `fctid'

2023-11-26 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112707

HaoChen Gui  changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org

--- Comment #4 from HaoChen Gui  ---
I can't reproduce it on my env. Could you inform me the assembler version and
CPU type you used or the default CPU type. Thanks.

[Bug target/111449] memcmp (p,q,16) == 0 can be optimized better on ppc64 with vector comparison instructions

2023-11-17 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111449

HaoChen Gui  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from HaoChen Gui  ---
Fixed

[Bug rtl-optimization/112417] expand_builtin_return shoud check alignment for the memory reference

2023-11-06 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112417

HaoChen Gui  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |guihaoc at gcc dot 
gnu.org
 CC||guihaoc at gcc dot gnu.org

--- Comment #1 from HaoChen Gui  ---
Created attachment 56521
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56521=edit
proposed patch

[Bug rtl-optimization/112417] New: expand_builtin_return shoud check alignment for the memory reference

2023-11-06 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112417

Bug ID: 112417
   Summary: expand_builtin_return shoud check alignment for the
memory reference
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

//test.c

void * foo (void * p)
{
  if (p)
__builtin_return (p);
}


when compiling it with mno-vsx on ppc64, it generates 16-byte aligned vector
load instructions for the memory reference which is 1-byte aligned.

(insn 28 27 30 4 (set (reg:V4SF 66 2)
(mem:V4SF (plus:DI (reg/v/f:DI 118 [ p ])
(reg:DI 120)) [0  S16 A8])) "test4.c":4:5 1676
{*altivec_movv4sf}

It's unsafe as 16-byte aligned vector load instuction does an "AND -16" on the
memory address by itself. I think expand_builtin_return should check the
alignment and call misaligned_mem_ref expand to load the memory reference.

[Bug target/88558] Inline lrint, lrintf

2023-10-09 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88558

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |guihaoc at gcc dot 
gnu.org
 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from HaoChen Gui  ---
fixed.

[Bug target/111449] New: memcmp (p,q,16) == 0 can be optimized better on ppc64 with vector comparison instructions

2023-09-17 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111449

Bug ID: 111449
   Summary: memcmp (p,q,16) == 0 can be optimized better on ppc64
with vector comparison instructions
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

int compare (const char* s1, const char* s2)
{
  return __builtin_memcmp (s1, s2, 16) == 0;
}


trunk outputs
ld 10,0(3)
ld 9,0(4)
cmpd 0,10,9
beq 0,.L6
.L2:
li 3,1
cntlzw 3,3
srwi 3,3,5
blr
.p2align 4,,15
.L6:
ld 10,8(3)
ld 9,8(4)
li 3,0
cmpd 0,10,9
bne 0,.L2
cntlzw 3,3
srwi 3,3,5
blr

Expect to use vector comparison to eliminate branches.
lxv 32,0(3)
lxv 33,0(4)
vcmpequb. 0,0,1
mfcr 3,2
rlwinm 3,3,25,1
blr

[Bug target/96762] ICE in extract_insn, at recog.c:2294 (error: unrecognizable insn)

2023-09-11 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96762

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #10 from HaoChen Gui  ---
Fixed and backported.

[Bug target/108812] gcc.target/powerpc/p9-sign_extend-runnable.c fails on power 9 BE

2023-09-11 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108812

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #5 from HaoChen Gui  ---
Fixed.

[Bug rtl-optimization/110034] The first popped allcono doesn't take precedence over later popped in ira coloring

2023-08-30 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110034

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #6 from HaoChen Gui  ---
It's not a problem.

[Bug target/108728] gcc.dg/torture/float128-cmp-invalid.c fails on power 9 BE

2023-08-29 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108728

HaoChen Gui  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from HaoChen Gui  ---
Fixed by xfail the test case.

[Bug rtl-optimization/110034] The first popped allcono doesn't take precedence over later popped in ira coloring

2023-08-29 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110034

--- Comment #5 from HaoChen Gui  ---
(In reply to Vladimir Makarov from comment #4)
> Thank you for providing the test case.
> 
> To be honest I don't see why assigning to hr3 to r134 is better.
> Currently we have the following assignments:
> 
> hr9->r134; hr3->r173; hr3->r124
> 
> and the related preferences:
> 
>   cp11:a18(r134)<->a29(r173)@125:shuffle
>   pref3:a29(r173)<-hr3@2000
>   pref4:a0(r124)<-hr3@125
> 
> This removes cost 2000 (pref3) and cost 125 (pref4) and adds cost 125
> (cp11).  The profit is 2000
> 
> If we started with r173, we would have the following assignments:
> 
> hr3->r173; hr3->r134; ->r124
> 
> This would remove cost 2000 (pref3) and cost 125 (cp11) and add cost
> 125 (pref).  The profit would be the same 2000.
> 
> Choice of heuristics is very time consuming.  I spent a lot of time to
> try and benchmark numerous ones.  I clearly remember that introduction
> of pseudo threads for colorable busket gave visible performance
> improvement.  Currently we assign pseudos from a thread with the
> biggest frequency first (r173 and r134) and a pseudo (r134) with the
> biggest frequency first from the same thread.  I think it is logical.
> 
> Also it is always possible to find a test (not this case) where
> heuristics give some undesirable results.  RA is NP-complete task even
> in the simplest formulation. We can not get the optimal solution for
> reasonable time.
> 
> Still I am open to change any heuristic if somebody can show that it
> improves performance for some credible benchmark (I prefer SPEC2007)
> on major GCC targets.

Thanks for your explanation. I agree with it. I also checked the assembly and
found there is no potential performance gain when r3 is assigned to r134. It
should be not a bug.

[Bug target/106769] PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz

2023-08-16 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106769

HaoChen Gui  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from HaoChen Gui  ---
fixed

[Bug target/110429] Redundant vector extract instruction on P9

2023-08-16 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110429

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from HaoChen Gui  ---
Fixed

[Bug target/103605] [PowerPC] fmin/fmax should be inlined always with xsmindp/xsmaxdp

2023-08-16 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103605

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #10 from HaoChen Gui  ---
Fixed.

[Bug target/104124] Poor optimization for vector splat DW with small consts

2023-07-13 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #6 from HaoChen Gui  ---
fixed

[Bug rtl-optimization/107013] Add fmin/fmax to RTL codes

2023-07-10 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107013

HaoChen Gui  changed:

   What|Removed |Added

 CC||joseph at codesourcery dot com

--- Comment #1 from HaoChen Gui  ---
If fmin/max are added as new RTL codes, the fmin/max unspec in some targets can
be replaced with RTL codes. Do you think it is necessary? If so, I can draft
one. Looking forward to your advice. Thanks.

[Bug target/110331] ppc64 vec_extract with constant index is suboptimal on P8

2023-06-29 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110331

--- Comment #1 from HaoChen Gui  ---
Even the P9 assembly is not good, as vextu* has a higher lantency than mfvsrd. 
li 9,12
vextubrx 3,9,2

[Bug target/110429] New: Redundant vector extract instruction on P9

2023-06-27 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110429

Bug ID: 110429
   Summary: Redundant vector extract instruction on P9
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

//test.c
#include 
void extract_int_2 (int *p, vector int a) { *p = vec_extract (a, 2); }

On P9 LE, it generates
xxextractuw 34,34,4
stxsiwx 34,0,3

The xxextractuw is unnecessary as the extracted int is just at word[1].

[Bug target/110331] New: ppc64 vec_extract with constant index is suboptimal on P8

2023-06-20 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110331

Bug ID: 110331
   Summary: ppc64 vec_extract with constant index is suboptimal on
P8
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

//test.c
#include 

#ifdef __BIG_ENDIAN__
#define LANE_B 3
#else
#define LANE_B 12
#endif

unsigned char foo1 (vector unsigned char v)
{
  return vec_extract (v, LANE_B);
}

Trunk generates:
vspltb 2,2,3
mfvsrd 3,34
rlwinm 3,3,0,0xff

While it can be optimized as:
mfvsrd 3,34
rldicl 3,3,32,56

[Bug target/106769] PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz

2023-06-06 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106769

--- Comment #3 from HaoChen Gui  ---
(In reply to Peter Bergner from comment #2)
> I wonder if Ajit's REE changes catch this unneeded zero extension?

mfvsrwz can be defined as a zero-extend on a vector select other than a SI mode
move from "wa" to "r". Then the combine pass can help us eliminate the
redundent zero-extend. I will submit a patch for it.

[Bug target/106769] PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz

2023-05-31 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106769

HaoChen Gui  changed:

   What|Removed |Added

   Last reconfirmed||2023-05-31
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1

[Bug rtl-optimization/110034] The first popped allcono doesn't take precedence over later popped in ira coloring

2023-05-30 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110034

--- Comment #3 from HaoChen Gui  ---
Created attachment 55215
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55215=edit
ira dump

[Bug rtl-optimization/110034] The first popped allcono doesn't take precedence over later popped in ira coloring

2023-05-30 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110034

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #2 from HaoChen Gui  ---
Created attachment 55214
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55214=edit
test case

Compile it with -O3 -mcpu=power9 -fira-verbose=20 > ira_dump.out 2>&1

[Bug target/106769] PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz

2023-05-30 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106769

HaoChen Gui  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |guihaoc at gcc dot 
gnu.org
 CC||guihaoc at gcc dot gnu.org

--- Comment #1 from HaoChen Gui  ---
The problem only occurs below P9. It can be reproduced with -mcpu=power8

[Bug rtl-optimization/110034] New: The first popped allcono doesn't take precedence over later popped in ira coloring

2023-05-30 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110034

Bug ID: 110034
   Summary: The first popped allcono doesn't take precedence over
later popped in ira coloring
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

Followings are ira dumps from a test case. r134 has only one cp(shuffle) with
r173. The r173 and r124 both have preferred hard register r3. r134 is first
popped the register but it fails to get hard register r3 as the conflict cost
is high. If r173 is aheaf of r134, the r134 can get hard register r3 as there
is no hop between r3 and r134 after r173 is assigned r3. Seems the first popped
allcono(r134) doesn't take precedence over later popped allcono(r124).

r173 has a preferred hard register and has no conflict allcono. So r173 can
always be assigned r3. 

;; a29(r173,l0) conflicts:
;; total conflict hard regs:
;; conflict hard regs:

...

  cp11:a18(r134)<->a29(r173)@125:shuffle
  pref0:a12(r158)<-hr100@711
  pref1:a23(r144)<-hr100@920
  pref2:a25(r140)<-hr100@1842
  pref3:a29(r173)<-hr3@2000
  pref4:a0(r124)<-hr3@125

...

Start updating from pref of hr3 for a29r173:
  a18r134 (hr3): update cost by -62, conflict cost by -62

...

  Pushing a1(r169,l0)(cost 0)
  Pushing a0(r124,l0)(cost 0)
  Pushing a22(r146,l0)(cost 0)
  Pushing a20(r125,l0)(cost 0)
  Pushing a29(r173,l0)(cost 0)
  Pushing a18(r134,l0)(cost 0)
  Popping a18(r134,l0)  -- (9=0,0) (10=0,0) (8=0,0) (7=0,0) (6=0,0) (5=0,0)
(4=0,0) (3=-62,147) (11=0,0) (0=8000,8000) (31=7,7) (30=7,7) (29=7,7) (28=7,7)
(27=7,7) (26=7,7) (25=7,7) (24=7,7) (23=7,7) (22=7,7) (21=7,7) (20=7,7)
(19=7,7) (18=7,7) (17=7,7) (16=7,7) (15=7,7) (14=7,7) (12=0,0)
Start restoring from a18r134:
  a29r173 (hr3): update cost by -62, conflict cost by -62
Start updating from a18r134 by copies:
  a29r173 (hr9): update cost by -250, conflict cost by -250
assign reg 9
  Popping a29(r173,l0)  -- (9=1750,1750) (10=2000,2000) (8=2000,2000)
(7=2000,2000) (6=2000,2000) (5=2000,2000) (4=2000,2000) (3=-2062,-2062)
(11=2000,2000) (0=2000,2000) (31=2007,2007) (30=2007,2007) (29=2007,2007)
(28=2007,2007) (27=2007,2007) (26=2007,2007) (25=2007,2007) (24=2007,2007)
(23=2007,2007) (22=2007,2007) (21=2007,2007) (20=2007,2007) (19=2007,2007)
(18=2007,2007) (17=2007,2007) (16=2007,2007) (15=2007,2007) (14=2007,2007)
(12=2000,2000)
Start restoring from a29r173:
Start updating from a29r173 by copies:
assign reg 3

...

  Popping a0(r124,l0)  -- (8=0,0) (7=0,0) (6=0,0) (5=0,0) (4=0,0)
(3=-250,-250) (11=0,0) (0=0,0) (31=7,7) (30=7,7) (29=7,7) (28=7,7) (27=7,7)
(26=7,7) (25=7,7) (24=7,7) (23=7,7) (22=7,7) (21=7,7) (20=7,7) (19=7,7)
(18=7,7) (17=7,7) (16=7,7) (15=7,7) (14=7,7) (12=0,0)
Start restoring from a0r124:
Start updating from a0r124 by copies:
  a1r169 (hr3): update cost by -44, conflict cost by -44
  a2r177 (hr3): update cost by -11, conflict cost by -11
assign reg 3

[Bug target/54063] [10/11/12/13/14 regression] on powerpc64 gcc 4.9/8 generates larger code for global variable accesses than gcc 4.7

2023-04-20 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54063

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #26 from HaoChen Gui  ---
I made an experiment to move the split of "tocref" berfore the reload (do
it at split1). The additional addis can be optimized out by postreload cse on
P9. Also Tested SPEC 2017, it seems not hit the problems Alan pointed out. But,
there are several other issues.
1. The optimization relies on the sequence of insns. On P8, the memory load
insn is moved ahead to the second addis by sched pass. So the postreload cse
can't optimzies it as the r9 is used by the load.
2. The patch causes different register assignment. By comparing the object
files in SPEC, we can see that the register assignment changes and it tends to
use less registers with the patch. 
3. The patch has side effect on BB head merging in jump2 pass. The sched pass
commonly separates the two tocref insns if they're already split. Thus the
sequence of insns in two branche arms might be changed. Sometime the BB head
merging can be done with the patch, can't be done without the patch. While
sometime it can't be done with the patch, but it can be done without the patch.
The both positive and negative examples can be found in object files.

[Bug target/103628] ICE: Segmentation fault (in gfc_conv_tree_to_mpfr)

2023-02-22 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103628

HaoChen Gui  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |guihaoc at gcc dot 
gnu.org
 CC||guihaoc at gcc dot gnu.org

--- Comment #5 from HaoChen Gui  ---
The memory representation of IBM long double is not unique. It's actually the
sum of two 64-bit doubles. 

During decoding, the real variable b can be 
b = = {cl = 1, decimal = 0, sign = 0, signalling = 0, canonical = 0,
  uexp = 67108357, sig = {0, 0, 9295712554570040320}}
which is sum of following two doubles
u = {cl = 1, decimal = 0, sign = 0, signalling = 0, canonical = 0,
  uexp = 67108356, sig = {0, 0, 9295712899447228416}}
v = {cl = 1, decimal = 0, sign = 0, signalling = 0, canonical = 0,
  uexp = 67108356, sig = {0, 0, 9295712209692852224}}

During encoding, the real variable b can be
b = {cl = 1, decimal = 0, sign = 0, signalling = 0, canonical = 0,
  uexp = 67108357, sig = {0, 0, 9295712554570040320}}
which is splited to following two doubles
u = {cl = 1, decimal = 0, sign = 0, signalling = 0, canonical = 0,
  uexp = 67108357, sig = {0, 0, 9295712554570039296}}
v = {cl = 1, decimal = 0, sign = 0, signalling = 0, canonical = 0,
  uexp = 67108304, sig = {0, 0, 9223372036854775808}}

After decoding and encoding, the memory representation changes. After PR95450
added a verification of decoding/encoding check, native_interpret_expr returns
a NULL tree for this case which causes ICE.

Shall we disable Hollerith constant for IBM long double(-mabi=ibmlongdouble)?
Or just throw it to upper layer and let parser report an error? Please advice.

[Bug target/100952] [12/13 regression] several test case failures after r12-1202

2022-12-19 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100952

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #19 from HaoChen Gui  ---
Closing as fixed then (pr56605.c still fails on older branches, but that is
harmless). Other issues are all fixed.

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2022-12-13 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED
   Assignee|unassigned at gcc dot gnu.org  |guihaoc at gcc dot 
gnu.org
 CC||guihaoc at gcc dot gnu.org

--- Comment #17 from HaoChen Gui  ---
Both issues are fixed.

[Bug target/108004] x-form logical operations with dot instructions are not emitted.

2022-12-07 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108004

--- Comment #4 from HaoChen Gui  ---
$cat asm_test.c
#include 

unsigned long foo() {
  unsigned long res;
  __asm__ ("li 3,0x\n\t"
   "li 4,0xfff1\n\t"
   "and. 3,3,4\n\t"
   "mfcr %0"
   : "=r" (res));
  return res;
}

void
main()
{
  printf ("%lx\n", foo());
}
$ gcc -O1 -o asm_test asm_test.c && ./asm_test
82000482

Use the assembly to test the "and.". The bit32 (cr0 LT bit) is set when the
result is less than 0.

[Bug target/108004] x-form logical operations with dot instructions are not emitted.

2022-12-06 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108004

--- Comment #3 from HaoChen Gui  ---
(In reply to Andrew Pinski from comment #2)
> Especially when it comes to signed comparisons.

>From the ISA,
For all fixed-point instructions in which Rc=1, and for
addic., andi., and andis., the first three bits of CR Field
0 (bits 32:34 of the Condition Register) are set by
signed comparison of the result to zero, and the fourth
bit of CR Field 0 (bit 35 of the Condition Register) is
copied from the SO field of the XER.

[Bug target/108004] New: x-form logical operations with dot instructions are not emitted.

2022-12-06 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108004

Bug ID: 108004
   Summary: x-form logical operations with dot instructions are
not emitted.
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

//test case
int foo (int a, int b, int c, int d)
{
  return (a & b) > 0 ? c : d;
}

//assemble on P9
and 3,3,4
cmpwi 0,3,0
isel 5,5,6,1
extsw 3,5

The "and" and "cmpwi" can be optimized to "and." instruction. The same as "or"
and "xor".

[Bug target/103109] madd not used for multiply add on POWER9

2022-10-10 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103109

--- Comment #4 from HaoChen Gui  ---
(In reply to Peter Bergner from comment #3)
> (In reply to HaoChen Gui from comment #2)
> > Fixed by r13-2107.
> 
> This is marked version = GCC 12.  Were you planning on backporting this?


Not sure if the patch needs to be back ported. It's not a functional issue.

[Bug rtl-optimization/107013] New: Add fmin/fmax to RTL codes

2022-09-22 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107013

Bug ID: 107013
   Summary: Add fmin/fmax to RTL codes
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

Could we add fmin/fmax to RTL codes so that the C standard fmin/fmax can be
represented in RTL without UNSPECs? Currently we only have smin/smax that are
not valid for NaNs, or when the sign of zeros is relevant.

C standard for fmax
F.10.9.2 The fmax functions
1 If just one argument is a NaN, the fmax functions return the other argument
(if both arguments are NaNs, the functions return a NaN).

2 The returned value is exact and is independent of the current rounding
direction mode.

3 The body of the fmax function might be374)

{ return (isgreaterequal(x, y) ||
 isnan(y)) ? x : y; }
Footnotes

374) Ideally, fmax would be sensitive to the sign of zero, for example
fmax(-0.0, +0.0) would return +0; however, implementation in software might be
impractical.

[Bug middle-end/102316] Unexpected stringop-overflow Warnings on POWER CPU

2022-08-25 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102316

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org

--- Comment #4 from HaoChen Gui  ---
#define LEN 4

struct {
  char c[LEN]
} d;

extern int a;
extern char* b;

int p() {
  for (int i = 0; i < a; i++) {
d.c[i] = b[i];
  }
  return 0;
}

Above codes cause the same errors on x86. When setting the LEN to 8, it can be
also reproduced on aarch64. It's a common problem.

The iteration number of reset loop after vectorization should not only decided
by variable "a" but also by the length of array. If the len is 5 and vector
size is 4, the reset loop should be only executed once. Currently iteration
number only depends on variable "a". Then it is complete unrolled 3 times if
vector size is 4. That causes the warning.

   [local count: 398179264]:
  # i_30 = PHI 
  _32 = (sizetype) i_30;
  _33 = b.0_1 + _32;
  _34 = *_33;
  d.c[i_30] = _34;
  i_36 = i_30 + 1;
   if (i_36 < a.1_13)  // iterations depend on "a" only, the length of array is
not take into consideration
goto ; [89.00%]
  else
goto ; [11.00%]

[Bug target/103109] madd not used for multiply add on POWER9

2022-08-18 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103109

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 CC||guihaoc at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from HaoChen Gui  ---
Fixed by r13-2107.

[Bug target/102146] [11 regression] several test cases fails after r11-8940

2022-08-16 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102146

--- Comment #20 from HaoChen Gui  ---
(In reply to Segher Boessenkool from comment #19)
> Hi guys,
> 
> What testcases are still failing?  I'm a bit lost :-)

pr56605.c is still not fixed.

+FAIL: gcc.target/powerpc/pr56605.c scan-rtl-dump-times combine
"(compare:CC ((?:and|zero_extend):(?:DI) ((?:sub)?reg:[SD]I" 1

[Bug target/103498] Spec 2017 imagick_r is 2.62% slower on Power10 with pc-relative addressing compared to not using pc-relative addressing

2022-08-09 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103498

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #1 from HaoChen Gui  ---
Tested imagick_r on Power10 DD2. The performance is the same between Ofast with
pcrel and Ofast without pcrel. Not sure if DD2 fixed the regression.

[Bug target/95737] PPC: Unnecessary extsw after negative less than

2022-07-28 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95737

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #11 from HaoChen Gui  ---
Fixed.

[Bug target/100694] PPC: initialization of __int128 is very inefficient

2022-07-28 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

--- Comment #6 from HaoChen Gui  ---
I made a patch to convert ashift to move when the second operand is const0_rtx.
With the patch, the expand dump is just like aarch64's. But the problem is
still there. 
I tested the patch with SPECint. All the object files are the same as base.
Seems it is always optimized at later passes.

[Bug target/100996] rs6000 p10 vector add-add fusion should work with -m32 but doesn't

2022-07-28 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100996

--- Comment #2 from HaoChen Gui  ---
(In reply to acsawdey from comment #0)
> The fusion-p10-addadd.c test case does not get vector add-add fusion when
> compiling with -m32:
> 
> /home/sawdey/work/gcc/trunk/build/gcc/xgcc
> -B/home/sawdey/work/gcc/trunk/build/gcc/
> /home/sawdey/work/gcc/trunk/gcc/gcc/testsuite/gcc.target/powerpc/fusion-p10-
> addadd.c  -m32  -fdiagnostics-plain-output  -mcpu=power10 -O3 -dap
> -fno-ident -S
> 
> typedef vector long vlong;
> vlong vaddadd(vlong a, vlong b, vlong c)
> {
>   return a+b+c;
> }
> 
> vaddadd:
> .LFB3:
> .cfi_startproc
> vadduwm 2,2,3# 8[c=4 l=4]  addv4si3
> vadduwm 2,2,4# 14   [c=4 l=4]  addv4si3
> blr  # 24   [c=4 l=4]  simple_return
> .cfi_endproc

The vadduwm is not in P10 fusion sequence. Only vaddudm is in.

[Bug target/100996] rs6000 p10 vector add-add fusion should work with -m32 but doesn't

2022-07-28 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100996

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|ASSIGNED|RESOLVED
 CC||guihaoc at gcc dot gnu.org

--- Comment #1 from HaoChen Gui  ---
(In reply to acsawdey from comment #0)
> The fusion-p10-addadd.c test case does not get vector add-add fusion when
> compiling with -m32:
> 
> /home/sawdey/work/gcc/trunk/build/gcc/xgcc
> -B/home/sawdey/work/gcc/trunk/build/gcc/
> /home/sawdey/work/gcc/trunk/gcc/gcc/testsuite/gcc.target/powerpc/fusion-p10-
> addadd.c  -m32  -fdiagnostics-plain-output  -mcpu=power10 -O3 -dap
> -fno-ident -S
> 
> typedef vector long vlong;
> vlong vaddadd(vlong a, vlong b, vlong c)
> {
>   return a+b+c;
> }
> 
> vaddadd:
> .LFB3:
> .cfi_startproc
> vadduwm 2,2,3# 8[c=4 l=4]  addv4si3
> vadduwm 2,2,4# 14   [c=4 l=4]  addv4si3
> blr  # 24   [c=4 l=4]  simple_return
> .cfi_endproc

The vadduwm is not in P10 fusion sequence. Only vaddudm is in.

[Bug target/100694] PPC: initialization of __int128 is very inefficient

2022-07-25 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #5 from HaoChen Gui  ---
(In reply to Segher Boessenkool from comment #4)
> On aarch64 we have (in expand):
> 
> ;; i_4 = i_3 << 64;
> 
> (insn 10 9 11 (set (subreg:DI (reg/v:TI 94 [ i ]) 8)
> (subreg:DI (reg/v:TI 93 [ i ]) 0)) "100694.c":4:6 -1
>  (nil))
> 
> (insn 11 10 0 (set (subreg:DI (reg/v:TI 94 [ i ]) 0)
> (const_int 0 [0])) "100694.c":4:6 -1
>  (nil))
> 
> But on rs6000 we get:
> 
> ;; i_4 = i_3 << 64;
> 
> (insn 10 9 11 (set (subreg:DI (reg/v:TI 119 [ i ]) 0)
> (ashift:DI (subreg:DI (reg/v:TI 118 [ i ]) 8)
> (const_int 0 [0]))) "100694.c":4:6 -1
>  (nil))
> 
> (insn 11 10 0 (set (subreg:DI (reg/v:TI 119 [ i ]) 8)
> (const_int 0 [0])) "100694.c":4:6 -1
>  (nil))
> 
> What the what.

On rs6000, the insn 10 is optimized at forward propagation pass.
test.c.261r.fwprop1:
(insn 10 5 11 2 (set (subreg:DI (reg/v:TI 119 [ i ]) 8)
(reg/v:DI 122 [ hi ])) "test.c":4:6 670 {*movdi_internal64}
 (expr_list:REG_DEAD (reg:DI 126 [ i ])

Seems aarch64 optimizes it at expand pass.

Now the problem is "ior" operation is done with TImode on rs6000 while it is
done with two subreg:DI on aarch64.  The subreg pass can decomposes the
register which is always used by subreg. If the ior is done with two subreg:DI
on rs6000, it can be optimized by subreg pass. 

on rs6000:
(insn 14 13 15 2 (set (reg:TI 125 [ i ])
(ior:TI (reg:TI 124 [ lo ])
(reg/v:TI 119 [ i ]))) "test.c":5:6 494 {*boolti3_internal}

on aarch64
(insn 21 20 22 2 (set (reg:DI 100)
(ior:DI (subreg:DI (reg:TI 99) 0)
(subreg:DI (reg/v:TI 94 [ i ]) 0))) "/app/example.c":5:6 521
{iordi3}
(insn 23 22 24 2 (set (reg:DI 101)
(ior:DI (subreg:DI (reg:TI 99) 8)
(subreg:DI (reg/v:TI 94 [ i ]) 8))) "/app/example.c":5:6 521
{iordi3}

[Bug target/103316] PowerPC: Gimple folding of int128 comparisons produces suboptimal code

2022-06-16 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103316

HaoChen Gui  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
   Assignee|unassigned at gcc dot gnu.org  |guihaoc at gcc dot 
gnu.org
 Resolution|--- |FIXED

--- Comment #18 from HaoChen Gui  ---
Fixed by r13-1131

[Bug target/102146] [11 regression] several test cases fails after r11-8940

2022-05-19 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102146

--- Comment #15 from HaoChen Gui  ---
As r12-8128 was revoked, failure of pr56605.c is still not fixed.

[Bug tree-optimization/105414] constant folding for fmin/max(snan, snan) is wrong

2022-05-11 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105414

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #12 from HaoChen Gui  ---
Fixed.

[Bug tree-optimization/105414] constant folding for fmin/max(snan, snan) is wrong

2022-04-29 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105414

--- Comment #8 from HaoChen Gui  ---
(In reply to Jakub Jelinek from comment #7)
> Sure, but you don't want to do that at least if flag_trapping_math.
> Otherwise, the predicate would be tree_expr_signaling_nan_p and real_nan
> function with "", 1 as the middle 2 arguments can create it.  But note that
> nothing in match.pd does that right now, so I don't think we should do it in
> this case either.

If either of arguments is sNaN, fmin/max should return a qNaN. So I really want
to create a pattern in match.pd to do this. It needs to create a qNaN and
return it.  I can't find an existing example in match.pd. Seems the return
value should be relevant to the arguments?

[Bug tree-optimization/105414] constant folding for fmin/max(snan, snan) is wrong

2022-04-28 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105414

--- Comment #6 from HaoChen Gui  ---
(In reply to Richard Biener from comment #4)
> I think you want
> 
>  (if (!tree_expr_maybe_signaling_nan_p (@0))
> ...
> 
> instead.

Thanks so much for comments. Do we have a way to return a NaN directly in
match.pd when both arguments are sNaN?

[Bug tree-optimization/105414] constant folding for fmin/max(snan, snan) is wrong

2022-04-27 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105414

--- Comment #3 from HaoChen Gui  ---
For fmin/max behavior, I referred the this ticket. 
https://sourceware.org/bugzilla/show_bug.cgi?id=20947

[Bug tree-optimization/105414] constant folding for fmin/max(snan, snan) is wrong

2022-04-27 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105414

--- Comment #2 from HaoChen Gui  ---
(In reply to Andrew Pinski from comment #1)
> What target is this on?

I tested it on ppc64le. But I think it should be on all targets?

[Bug tree-optimization/105414] New: constant folding for fmin/max(snan, snan) is wrong

2022-04-27 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105414

Bug ID: 105414
   Summary: constant folding for fmin/max(snan, snan) is wrong
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

gcc -O0 -fsignaling-nans -D_WANT_SNAN -lm -o fmin fmin.c && ./fmin
(snan, snan), fmin: nan
gcc -O3 -fsignaling-nans -D_WANT_SNAN -lm -o fmin fmin.c && ./fmin
(snan, snan), fmin: snan

The fmin(SNaN, SNaN) got different result with O0 and O3. The result should be
nan(QNaN) according to C standard. 

The problem might be at match.pd. fmin(a,a) can't be folded to a when a is
SNaN. I propose following patch to fix it.

diff --git a/gcc/match.pd b/gcc/match.pd
index cad61848daa..2c2efda158b 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3093,7 +3093,8 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (for minmax (min max FMIN_ALL FMAX_ALL)
  (simplify
   (minmax @0 @0)
-  @0))
+  (if(!HONOR_SNANS (@0) || !TREE_REAL_CST (@0).signalling)
+  @0)))
 /* min(max(x,y),y) -> y.  */
 (simplify
  (min:c (max:c @0 @1) @1)

[Bug target/103605] [PowerPC] fmin/fmax should be inlined always with xsmindp/xsmaxdp

2022-04-26 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103605

--- Comment #6 from HaoChen Gui  ---
gcc -O0 -fsignaling-nans -D_WANT_SNAN -lm   -o main main.c && ./main
(nan, 3.0), fmin: 3.0, builtin: 3.0, xsmincdp: 3.0, xsmindp: 3.0
(3.0, nan), fmin: 3.0, builtin: nan, xsmincdp: nan, xsmindp: 3.0
(snan, 3.0), fmin: nan, builtin: 3.0, xsmincdp: 3.0, xsmindp: nan
(3.0, snan), fmin: nan, builtin: snan, xsmincdp: snan, xsmindp: nan
(snan, snan), fmin: nan, builtin: snan, xsmincdp: snan, xsmindp: nan

For 'fmin', the result is qnan if either argument is snan. The result of
'xsmindp' matches 'fmin'. 
I will make patch to implement fmin_optab by xsmindp and fmax_optab by xsmaxdp.
Thanks.

[Bug target/103605] [PowerPC] fmin/fmax should be inlined always with xsmindp/xsmaxdp

2022-04-26 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103605

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #2 from HaoChen Gui  ---
I think both xsmindp and xsmincdp is not consistent with C99/11 standard. So
without fast-math, it can't be implemented by xsmindp/xsmincdp.

C99/11 standard
If just one argument is a NaN, the fmin functions return the other argument (if
both arguments are NaNs, the functions return a NaN).
fmin(NaN, 3.0) = fmin(3.0, NaN) = 3.0

xsmindp
The minimum of a QNaN and any value is that value. The minimum of any value and
an SNaN is that SNaN converted to a QNaN.
xsmindp(NaN, 3.0) = 3.0 xsmindp(3.0, NaN) = NaN

xsmincdp
If either src1 or src2 is a NaN, result is src2.
Otherwise, if src1 is less than src2, result is src1.
Otherwise, result is src2.
xsmincdp(NaN, 3.0) = 3.0 xsmincdp(3.0, NaN) = NaN

[Bug target/102146] [11 regression] several test cases fails after r11-8940

2022-04-13 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102146

--- Comment #10 from HaoChen Gui  ---
(In reply to HaoChen Gui from comment #9)
> Could you backport the patch to GCC11? Thanks.

Please ignore it as the patch has problem. Thanks.

[Bug target/102146] [11 regression] several test cases fails after r11-8940

2022-04-13 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102146

--- Comment #9 from HaoChen Gui  ---
Could you backport the patch to GCC11? Thanks.

[Bug tree-optimization/105030] store motion if-change flag causes if-conversion optimization can't be taken.

2022-04-08 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105030

--- Comment #13 from HaoChen Gui  ---
Could we use the original alias set if the tree code of 'atemp' is var_decl? Is
it safe? In which situation we shall use alias-set zero? Thanks.

[Bug tree-optimization/105030] store motion if-change flag causes if-conversion optimization can't be taken.

2022-04-08 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105030

--- Comment #11 from HaoChen Gui  ---
I tested C source code with Ofast. The Ofast enables data store race. It should
do store motion but it fails. The problem is on cselim pass. It does
conditional store replacement. The 'atemp' is converted to its alias set - 'MEM
 [(void *)]' in the if-else blocks and remains untouch out of the
if-else.

   atemp.0_5 = atemp;
  if (_4 < atemp.0_5)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 477815113]:
  cstore_18 = MEM  [(void *)];

   [local count: 955630225]:
  # cstore_17 = PHI 
  MEM  [(void *)] = cstore_17;

Then in lim2 pass, it thinks 'atemp' and 'MEM  [(void *)]' are
independent. So the store can't be moved out.

Memory reference 1: *_3
Memory reference 2: atemp
Memory reference 3: MEM  [(void *)]
Querying dependency of refs 3 and 1: independent.
Querying dependency of refs 3 and 2: dependent.
Querying SM WAR dependencies of ref 3 in loop 1: dependent

I wonder if 'atemp' is equivalent to 'MEM  [(void *)]'. Why we
shall use alias-set zero in cselim pass.

 /* Make both store and load use alias-set zero as we have to
 deal with the case of the store being a conditional change
 of the dynamic type.  */

  while (handled_component_p (*basep))
basep = _OPERAND (*basep, 0);
  if (TREE_CODE (*basep) == MEM_REF
  || TREE_CODE (*basep) == TARGET_MEM_REF)
TREE_OPERAND (*basep, 1)
  = fold_convert (ptr_type_node, TREE_OPERAND (*basep, 1));
  else
*basep = build2 (MEM_REF, TREE_TYPE (*basep),
 build_fold_addr_expr (*basep),
 build_zero_cst (ptr_type_node));

[Bug target/102146] [11 regression] several test cases fails after r11-8940

2022-04-07 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102146

--- Comment #5 from HaoChen Gui  ---
For prefix-no-update.c, the patch Segher proposed in PR103197 could fix it.

[Bug target/102146] [11 regression] several test cases fails after r11-8940

2022-04-07 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102146

--- Comment #4 from HaoChen Gui  ---
(In reply to Richard Biener from comment #2)
> What's the status on the remaining failures?

For pr56605.c,I already submitted a patch. Waiting for review.
https://gcc.gnu.org/pipermail/gcc-patches/2022-February/590958.html

[Bug tree-optimization/105030] store motion if-change flag causes if-conversion optimization can't be taken.

2022-03-28 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105030

--- Comment #9 from HaoChen Gui  ---
Escaped for 'atemp' doesn't be set with Fortran source code, while it's set
with C source code. 'auto_var_in_fn_p + pt_solution_includes' works for Fortran
code. But if the function is a head of the loop in Fortran, it's still unsafe
for multithreaded.

[Bug tree-optimization/105030] store motion if-change flag causes if-conversion optimization can't be taken.

2022-03-25 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105030

--- Comment #7 from HaoChen Gui  ---
The original case comes from a Fortran program. I rewrote it with C. As the
arguments are passed by reference in Fortran (by default), the problem is
common. But I am not sure if it has a large performance impact.

[Bug tree-optimization/105030] store motion if-change flag causes if-conversion optimization can't be taken.

2022-03-25 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105030

--- Comment #5 from HaoChen Gui  ---
(In reply to Richard Biener from comment #4)
> something like
> 
> void *bar (void *x)
> {
>   *(double *)x = 1.;
> }
> 
> void foo(int n)
> {
>double atemp;
>pthread_create (..., bar, );
>for (int i = 0; i < n; i++)
>  if (a[i] < atemp)
>atemp = a[i];
>pthread_join (...);
>if (atemp != 1.)
>  abort ();
> }
> 
> if it is ensured the store to atemp in the loop never takes place then
> we have created a store data race when applying store motion.  Of course
> thread creation/join can be hidden in other functions called from foo()
> as long as 'atemp' escapes to callers.

I got it. Thanks a lot.
In the original case, the bar is called after the loop. So we still think the
atemp is safe for multi-threaded, as atemp can't be changed by others threads?
If the bar is executed before the loop, it's unsafe. Could we check all blocks
dominated by the loop head to find out the reference used in a call?

[Bug tree-optimization/105030] store motion if-change flag causes if-conversion optimization can't be taken.

2022-03-23 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105030

--- Comment #3 from HaoChen Gui  ---
(In reply to Richard Biener from comment #2)
> That occured to me as well - I think the answer is maybe.  In principle
> foo() could launch a thread and make the 'atemp' available to it.  As long
> as foo() outlives thread termination that should be all well-defined and so
> we'd have to take this possibility into account.
> 
> There's currently no code ruling this out and doing that is likely difficult.
> So maybe the answer is to add -fallow-store-data-races={local,all}, map
> -fallow-store-data-races to -fallow-store-data-races=all and make
> -fallow-store-data-races=local the default?

For the local automatic variables(not global, not static), they should be
stored on one's own thread stack. I think they can't be accessed by other
threads. Other threads should create their own instance for the local automatic
variables. So 'atemp' is thread safe in the example code? Could you please
explain why it's unsafe when foo outlives thread termination? Thanks a lot.

diff --git a/gcc/tree-ssa-loop-im.cc b/gcc/tree-ssa-loop-im.cc
index 6d9316eed1f..e2b6b927351 100644
--- a/gcc/tree-ssa-loop-im.cc
+++ b/gcc/tree-ssa-loop-im.cc
@@ -2219,7 +2219,10 @@ execute_sm (class loop *loop, im_mem_ref *ref,
   bool always_stored = ref_always_accessed_p (loop, ref, true);
   if (maybe_mt
   && (bb_in_transaction (loop_preheader_edge (loop)->src)
- || (! flag_store_data_races && ! always_stored)))
+ || (! flag_store_data_races && ! always_stored
+ && (!auto_var_in_fn_p (ref->mem.ref, current_function_decl)
+ || TREE_THIS_VOLATILE (ref->mem.ref)
+ || TREE_STATIC (ref->mem.ref)
 multi_threaded_model_p = true;

Could we use above conditions to exclude local auto variables from
multi-threaded safe considerations?

[Bug tree-optimization/105030] store motion if-change flag causes if-conversion optimization can't be taken.

2022-03-23 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105030

HaoChen Gui  changed:

   What|Removed |Added

   Host||powerpc-*-linux-gnu

--- Comment #1 from HaoChen Gui  ---
   bool always_stored = ref_always_accessed_p (loop, ref, true);
   if (maybe_mt
   && (bb_in_transaction (loop_preheader_edge (loop)->src)
   || (! flag_store_data_races && ! always_stored)))
 multi_threaded_model_p = true;

Could we consider multi_threaded_model_p here is false for a local auto
variable?

[Bug tree-optimization/105030] New: store motion if-change flag causes if-conversion optimization can't be taken.

2022-03-22 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105030

Bug ID: 105030
   Summary: store motion if-change flag causes if-conversion
optimization can't be taken.
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

// source code
extern void bar (double *, int);

void foo (double a[], int n)
{
  double atemp = 0.5;
  for (int i = 0; i < n; i++)
if (a[i] < atemp)
  atemp = a[i];
  bar (, n);
}

// -O3 -fdump-tree-lim2
  if (_4 < atemp.0_5)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 477815112]:
  atemp_lsm.4_24 = _4;
  atemp_lsm_flag.5_25 = 1;

It creates the lsm flag in lim2 pass. So the "then" block has two sets which
blocks the if-conversion optimization.

//assemble -O3 -ffast-math -fno-unroll-loops on ppc64le
.L5:
lfd 0,0(3)
addi 3,3,8
fcmpu 0,12,0
ble 0,.L3
fmr 12,0
li 9,1
.L3:
bdnz .L5
andi. 9,9,0x1
beq 0,.L2
stfd 12,32(1)

Inefficient fcmpu is used. If the source code is tweaked as below, the
efficient xvmindp is generated.

// tweaked source code
extern void bar (double *, int);

void foo (double a[], int n)
{
  double atemp = 0.5;
  for (int i = 0; i < n; i++)
if (a[i] < atemp)
  atemp = a[i];
  double btemp = atemp;
  bar (, n);
}

//assembly
.L4:
lxv 0,0(9)
addi 9,9,16
xvmindp 12,12,0
bdnz .L4

[Bug target/103316] PowerPC: Gimple folding of int128 comparisons produces suboptimal code

2022-03-06 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103316

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #15 from HaoChen Gui  ---
I drafted a patch to define separate expanders for V1TI. It works. I wonder if
I shall add V1TI into VEC_I and define an unified expander for V16QI V8HI V4SI
V2DI and V1TI. Also some insn patterns should be merged then. Please advice.
Thanks a lot.

[Bug rtl-optimization/98179] gcc.dg/pr97954.c fails on (at least) BE powerpc

2022-02-21 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98179

HaoChen Gui  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID
 CC||guihaoc at gcc dot gnu.org

--- Comment #1 from HaoChen Gui  ---
Tested gcc11 on Power8 BE. Unable to reproduce this issue. The issue should be
already fixed by r11-5613-g404d0ca7820bbf258e2edfac423403ee31b48a7b.

[Bug target/103124] PPC: "mr" instruction is unnecessary when extending DI to V1TI

2022-01-17 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103124

HaoChen Gui  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from HaoChen Gui  ---
fixed by r12-6620

[Bug target/100952] [12 regression] several test case failures after r12-1202

2022-01-16 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100952

--- Comment #16 from HaoChen Gui  ---
prefix-no-update.c should be fixed by the patch Segher proposed in PR103197.
pr56605.c got a wrong fixed and failed with GCC11. I will submit a patch to fix
it.

[Bug target/103197] ppc inline expansion of memcpy/memmove should not use lxsibzx/stxsibx for a single byte

2022-01-16 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103197

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #11 from HaoChen Gui  ---
Segher,
  Will you commit your patch in stage4? Several issues are supposed to be fixed
by your patch. Thanks.

[Bug target/95737] PPC: Unnecessary extsw after negative less than

2022-01-14 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95737

--- Comment #9 from HaoChen Gui  ---
Add a pattern to convert the plus mode to DI. 

+(define_insn_and_split "*my_split"
+  [(set (match_operand:DI 0 "gpc_reg_operand")
+   (sign_extend:DI (plus:SI (match_operand:SI 1 "ca_operand")
+(const_int -1]
+  ""
+  "#"
+  ""
+  [(parallel [(set (match_dup 0)
+  (plus:DI (match_dup 2)
+   (const_int -1)))
+ (clobber (match_dup 2))])]
+{
+  operands[2] = copy_rtx (operands[1]);
+  PUT_MODE (operands[2], DImode);
+})

With the patch, the "extsw" could be optimized out. I compared the performance
between P8 code (with the patch) and P9 code. The performance of P9 is better. 
ISA says that computation with CA causes additional latency. It should be true.
The only concern is P9 code uses more register.

[Bug target/93127] PPC altivec vec_promote creates unnecessary xxpermdi instruction

2022-01-14 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93127

HaoChen Gui  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED
 CC||guihaoc at gcc dot gnu.org

--- Comment #4 from HaoChen Gui  ---
I committed a patch (r12-4987) which is related to this issue. But it doesn't
behave as the ticket expects. With the patch, vec_min/max is bound to
xv[min|max]dp when fast-math is not set. If fast-math is set, it can be folded
into scalar comparison. So with fast-math is set, the code could be
xsmaxdp 1,1,2
blr

[Bug target/95737] PPC: Unnecessary extsw after negative less than

2022-01-05 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95737

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #6 from HaoChen Gui  ---
//source code
unsigned long long negativeLessThan(unsigned long long a, unsigned long long b)
{
   return -(a < b);
}

//P8 with -O2
subfc 4,4,3
subfe 3,3,3
extsw 3,3


//P9 with -O2
li 10,0
li 9,1
cmpld 0,3,4
isel 3,9,10,0
neg 3,3

Seems cmp+isel on P9 is sub-optimal.

[Bug target/103784] suboptimal code for returning bool value on target ppc

2021-12-20 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103784

--- Comment #4 from HaoChen Gui  ---
output with "-fdump-tree-optimized=/dev/stdout"
;; Function foo (foo, funcdef_no=0, decl_uid=3317, cgraph_uid=1,
symbol_order=0)

Removing basic block 5
_Bool foo (int a, int b)
{
  _Bool _1;
  _Bool _5;

   [local count: 1073741824]:
  if (a_2(D) > 2)
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 708669601]:
  _5 = b_3(D) <= 9;

   [local count: 1073741824]:
  # _1 = PHI <_5(3), 0(2)>
  return _1;

}

[Bug target/103784] suboptimal code for returning bool value on target ppc

2021-12-20 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103784

--- Comment #2 from HaoChen Gui  ---
Sorry, I pasted wrong codes. Here are the correct ones.

//test.c
#include 

bool foo (int a, int b)
{
  if (a > 2)
return false;
  if (b < 10)
return true;
  return false;
}

//assembly with the trunk
cmpwi 0,3,2
bgt 0,.L3
cmpwi 0,4,9
li 3,1
isel 3,0,3,1
rldicl 3,3,0,63
blr
.p2align 4,,15
.L3:
li 3,0
rldicl 3,3,0,63
blr

The two zero extend (rldicl) are unnecessary?

[Bug target/103784] New: suboptimal code for returning bool value on target ppc

2021-12-20 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103784

Bug ID: 103784
   Summary: suboptimal code for returning bool value on target ppc
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

//test.c

#include 

bool foo (int a, int b)
{
  if (a > 2)
return false;
  if (b < 10)
return true;
  return true;
}

//assembly with trunk
ld 9,0(3)
cmpdi 0,9,0
add 10,9,4
beq 0,.L5
ldarx 8,0,3
cmpd 0,8,9
bne 0,.L4
stdcx. 10,0,3
bne 0,.L4
li 3,1
rldicl 3,3,0,63
blr
.p2align 4,,15
.L5:
li 3,0
rldicl 3,3,0,63
blr

//assembly with at13.0
subfic 3,3,2
srdi 3,3,63
xori 3,3,0x1
blr

The second branch and two zero extend are unnecessary. If it returns a integer,
the code seems good.

//test1.c
int foo (int a, int b)
{
  if (a > 2)
return 0;
  if (b < 10)
return 1;
  return 1;
}

//assembly with trunk
li 9,1
cmpwi 0,3,2
isel 3,0,9,1
blr

[Bug target/100952] [12 regression] several test case failures after r12-1202

2021-12-19 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100952

--- Comment #13 from HaoChen Gui  ---
Issue for fusion-p10-ldcmpi.c was fixed by r12-1655. 
https://gcc.gnu.org/pipermail/gcc-cvs/2021-June/349357.html

[Bug target/100736] ICE: unrecognizable insn

2021-12-08 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100736

--- Comment #4 from HaoChen Gui  ---
Yes, there is a question. With my patch, the test case generates following
assembly. Seems they have the same latency (cror vs. crnot). I wonder why we
need reverse the CR bit comparison when finite-math-only is set. Thanks.

without finite-math-only
bcdsub. %v2,%v2,%v3,0
cror 26,25,26
mfcr %r3,2
rlwinm %r3,%r3,27,1

with finite-math-only
bcdsub. %v2,%v2,%v3,0
crnot 26,24
mfcr %r3,2
rlwinm %r3,%r3,27,1

[Bug target/100736] ICE: unrecognizable insn

2021-12-07 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100736

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #2 from HaoChen Gui  ---
The root cause of the issue is condition rtx can't be recognized when
finite-math-only is set.
I drafted a patch to modify expand of "bcd__". It
expands as before when finite-math-only is not set. While it expands with a
reverse comparison(le -> ungt, ge -> unlt) when finite-math-only is set.
"rs6000_reverse_compare" is a helper method. The code is extracted from
"rs6000_emit_sCOND".
Any comments? Thanks a lot.

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index 93d237156d5..e91a1af6805 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -4415,7 +4415,7 @@ (define_insn "bcd_"
 ;; UNORDERED test on an integer type (like V1TImode) is not defined.  The type
 ;; probably should be one that can go in the VMX (Altivec) registers, so we
 ;; can't use DDmode or DFmode.
-(define_insn "*bcd_test_"
+(define_insn "bcd_test_"
   [(set (reg:CCFP CR6_REGNO)
(compare:CCFP
 (unspec:V2DF [(match_operand:VBCD 1 "register_operand" "v")
@@ -4542,6 +4542,18 @@ (define_expand "bcd__"
   "TARGET_P8_VECTOR"
 {
   operands[4] = CONST0_RTX (V2DFmode);
+  emit_insn (gen_bcd_test_ (operands[0], operands[1],
+  operands[2], operands[3],
+  operands[4]));
+  rtx cr6 = gen_rtx_REG (CCFPmode, CR6_REGNO);
+  rtx condition_rtx = gen_rtx_ (SImode, cr6, const0_rtx);
+  if (flag_finite_math_only)
+{
+  condition_rtx = rs6000_reverse_compare (condition_rtx);
+  PUT_MODE (condition_rtx, SImode);
+}
+  emit_insn (gen_rtx_SET (operands[0], condition_rtx));
+  DONE;
 })

 (define_insn "*bcdinvalid_"
diff --git a/gcc/config/rs6000/rs6000-protos.h
b/gcc/config/rs6000/rs6000-protos.h
index 14f6b313105..9b93e26bec2 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -114,6 +114,7 @@ extern enum rtx_code rs6000_reverse_condition
(machine_mode,
 extern rtx rs6000_emit_eqne (machine_mode, rtx, rtx, rtx);
 extern rtx rs6000_emit_fp_cror (rtx_code, machine_mode, rtx);
 extern void rs6000_emit_sCOND (machine_mode, rtx[]);
+extern rtx rs6000_reverse_compare (rtx);
 extern void rs6000_emit_cbranch (machine_mode, rtx[]);
 extern char * output_cbranch (rtx, const char *, int, rtx_insn *);
 extern const char * output_probe_stack_range (rtx, rtx, rtx);
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index ad860728169..39a36add08f 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -15690,19 +15690,14 @@ rs6000_emit_fp_cror (rtx_code code, machine_mode
mode, rtx x)
   return cc;
 }

-void
-rs6000_emit_sCOND (machine_mode mode, rtx operands[])
+rtx
+rs6000_reverse_compare (rtx condition_rtx)
 {
-  rtx condition_rtx = rs6000_generate_compare (operands[1], mode);
   rtx_code cond_code = GET_CODE (condition_rtx);
-
-  if (FLOAT_MODE_P (mode) && HONOR_NANS (mode)
-  && !(FLOAT128_VECTOR_P (mode) && !TARGET_FLOAT128_HW))
-;
-  else if (cond_code == NE
-  || cond_code == GE || cond_code == LE
-  || cond_code == GEU || cond_code == LEU
-  || cond_code == ORDERED || cond_code == UNGE || cond_code == UNLE)
+  if (cond_code == NE
+  || cond_code == GE || cond_code == LE
+  || cond_code == GEU || cond_code == LEU
+  || cond_code == ORDERED || cond_code == UNGE || cond_code == UNLE)
 {
   rtx not_result = gen_reg_rtx (CCEQmode);
   rtx not_op, rev_cond_rtx;
@@ -15716,6 +15711,19 @@ rs6000_emit_sCOND (machine_mode mode, rtx operands[])
   emit_insn (gen_rtx_SET (not_result, not_op));
   condition_rtx = gen_rtx_EQ (VOIDmode, not_result, const0_rtx);
 }
+  return condition_rtx;
+}
+
+void
+rs6000_emit_sCOND (machine_mode mode, rtx operands[])
+{
+  rtx condition_rtx = rs6000_generate_compare (operands[1], mode);
+
+  if (FLOAT_MODE_P (mode) && HONOR_NANS (mode)
+  && !(FLOAT128_VECTOR_P (mode) && !TARGET_FLOAT128_HW))
+  ;
+  else
+condition_rtx = rs6000_reverse_compare (condition_rtx);

   machine_mode op_mode = GET_MODE (XEXP (operands[1], 0));
   if (op_mode == VOIDmode)

[Bug target/100868] PPC: Inefficient code for vec_reve(vector double)

2021-12-06 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100868

HaoChen Gui  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 CC||guihaoc at gcc dot gnu.org
 Status|NEW |RESOLVED

--- Comment #3 from HaoChen Gui  ---
Fixed on trunk.

[Bug target/103124] PPC: "mr" instruction is unnecessary when extending DI to V1TI

2021-12-01 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103124

--- Comment #5 from HaoChen Gui  ---
(In reply to Segher Boessenkool from comment #4)
>   Skipping mode TI for zero_extend lowering.
>   Splitting mode TI for ashift lowering with shift amounts = 
>   Splitting mode TI for lshiftrt lowering with shift amounts = 
>   Splitting mode TI for ashiftrt lowering with shift amounts = 
> 
> All these should be fixed.  But it is a red herring here, since the shift we
> have is in DImode already anyway, not TImode.
> 
> Confirmed btw.

I not sure if splitting TI for zero_extend or shift is necessary. Do we
generate zero_extend:TI rtx or ashift:TI rtx? During expand, these operations
already expand to some operations on DI mode. They will be split latter if TI
copy splitting is allowed. 

zero_extend
b = (unsigned __int128) a;

(insn 7 6 8 2 (set (subreg:DI (reg:TI 119) 8)
(const_int 0 [0])) "split1.c":5:10 -1
 (nil))


ashift
b = a << 3;

(insn 6 3 7 2 (set (reg:DI 122)
(lshiftrt:DI (subreg:DI (reg/v:TI 119 [ a ]) 0)
(const_int 61 [0x3d]))) "split.c":4:9 -1
 (nil))
(insn 7 6 8 2 (set (subreg:DI (reg:TI 121) 8)
(ashift:DI (subreg:DI (reg/v:TI 119 [ a ]) 8)
(const_int 3 [0x3]))) "split.c":4:9 -1
 (nil))
(insn 8 7 9 2 (set (subreg:DI (reg:TI 121) 8)
(ior:DI (reg:DI 122)
(subreg:DI (reg:TI 121) 8))) "split.c":4:9 -1
 (nil))
(insn 9 8 10 2 (set (subreg:DI (reg:TI 121) 0)
(ashift:DI (subreg:DI (reg/v:TI 119 [ a ]) 0)
(const_int 3 [0x3]))) "split.c":4:9 -1
 (nil))

[Bug target/93453] PPC: rldimi not taken into account to avoid shift+or

2021-11-24 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93453

--- Comment #8 from HaoChen Gui  ---
I refined the patch and put all things in a helper - change_pseudo_and_mask. As
you mentioned, it's still a band-aid. The perfect solution might be a better
version of nonzero_bits. Thanks.

diff --git a/gcc/combine.c b/gcc/combine.c
index 892c834a160..f0e6ca5d8cf 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -11539,6 +11539,41 @@ change_zero_ext (rtx pat)
   return changed;
 }

+/* Convert a psuedo to psuedo AND with a mask if its nonzero_bits is less
+   than its mode mask.  */
+static bool
+change_pseudo_and_mask (rtx pat)
+{
+  bool changed = false;
+
+  rtx src = SET_SRC (pat);
+  if ((GET_CODE (src) == IOR
+   || GET_CODE (src) == XOR
+   || GET_CODE (src) == PLUS)
+  && (((GET_CODE (XEXP (src, 0)) == ASHIFT
+   || GET_CODE (XEXP (src, 0)) == LSHIFTRT
+   || GET_CODE (XEXP (src, 0)) == AND)
+  && REG_P (XEXP (src, 1)))
+ || ((GET_CODE (XEXP (src, 1)) == ASHIFT
+  || GET_CODE (XEXP (src, 1)) == LSHIFTRT
+  || GET_CODE (XEXP (src, 1)) == AND)
+ && REG_P (XEXP (src, 0)
+{
+  rtx *reg = REG_P (XEXP (src, 0))
+?  (SET_SRC (pat), 0)
+:  (SET_SRC (pat), 1);
+  machine_mode mode = GET_MODE (*reg);
+  unsigned HOST_WIDE_INT nonzero = nonzero_bits (*reg, mode);
+  if (nonzero < GET_MODE_MASK (mode))
+   {
+ rtx x = gen_rtx_AND (mode, *reg, GEN_INT (nonzero));
+ SUBST (*reg, x);
+ changed = true;
+   }
+ }
+  return changed;
+}
+
 /* Like recog, but we receive the address of a pointer to a new pattern.
We try to match the rtx that the pointer points to.
If that fails, we may try to modify or replace the pattern,
@@ -11565,9 +11600,18 @@ recog_for_combine (rtx *pnewpat, rtx_insn *insn, rtx
*pnotes)

   void *marker = get_undo_marker ();
   bool changed = false;
+  //bool PIX_opt = false;

   if (GET_CODE (pat) == SET)
-changed = change_zero_ext (pat);
+{
+  changed = change_pseudo_and_mask (pat);
+  if (changed)
+   {
+ maybe_swap_commutative_operands (SET_SRC (pat));
+ //PIX_opt = true;
+   }
+  changed |= change_zero_ext (pat);
+}
   else if (GET_CODE (pat) == PARALLEL)
 {
   int i;
@@ -11585,6 +11629,8 @@ recog_for_combine (rtx *pnewpat, rtx_insn *insn, rtx
*pnotes)

   if (insn_code_number < 0)
undo_to_marker (marker);
+  //else if (PIX_opt)
+   //fprintf (stdout, "PIX applied\n");
 }

   return insn_code_number;

[Bug target/93453] PPC: rldimi not taken into account to avoid shift+or

2021-11-22 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93453

--- Comment #6 from HaoChen Gui  ---
Sehger,
  Yes, I found that the nonzero_bits doesn't return exact value in other pass.
So calling nonzero_bits in md file is bad as it can't be recognized in other
pass. 
  Right now I want to convert a single pseudo to the pseudo AND with a mask in
combine pass if its nonzero_bits is less than its mode mask and the outer
operation is plus/ior/xor and its one of inner operation is
ashift/lshiftrt/and. Thus it is possible to match rotate and insert pattern.
What's your opinion? Thanks a lot.

(ior:DI (ashift:DI (reg:DI 125)
(const_int 32 [0x20]))
(reg:DI 126)))

is converted to

(ior:DI (ashift:DI (reg:DI 125)
   (const_int 32 [0x20]))
(and:DI (reg:DI 126)
(const_int 4294967295 [0xfff]


patch.diff

diff --git a/gcc/combine.c b/gcc/combine.c
index 892c834a160..8b72a5ec831 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -11539,6 +11539,26 @@ change_zero_ext (rtx pat)
   return changed;
 }

+/* Convert a psuedo to psuedo AND with a mask if its nonzero_bits is less
+   than its mode mask.  */
+static bool
+pseudo_and_with_mask (rtx *reg)
+{
+  bool changed = false;
+  gcc_assert (REG_P (*reg));
+
+  machine_mode mode = GET_MODE (*reg);
+  unsigned HOST_WIDE_INT nonzero = nonzero_bits (*reg, mode);
+  if (nonzero < GET_MODE_MASK (mode))
+{
+  rtx x = gen_rtx_AND (mode, *reg, GEN_INT (nonzero));
+  SUBST (*reg, x);
+  changed = true;
+  //fprintf (stdout, "PIX optimization\n");
+}
+  return changed;
+}
+
 /* Like recog, but we receive the address of a pointer to a new pattern.
We try to match the rtx that the pointer points to.
If that fails, we may try to modify or replace the pattern,
@@ -11565,9 +11585,34 @@ recog_for_combine (rtx *pnewpat, rtx_insn *insn, rtx
*pnotes)

   void *marker = get_undo_marker ();
   bool changed = false;
+  //bool PIX_opt = false;

   if (GET_CODE (pat) == SET)
-changed = change_zero_ext (pat);
+{
+  rtx src = SET_SRC (pat);
+  if ((GET_CODE (src) == IOR
+  || GET_CODE (src) == XOR
+  || GET_CODE (src) == PLUS)
+ && (((GET_CODE (XEXP (src, 0)) == ASHIFT
+   || GET_CODE (XEXP (src, 0)) == LSHIFTRT
+   || GET_CODE (XEXP (src, 0)) == AND)
+  && REG_P (XEXP (src, 1)))
+ || ((GET_CODE (XEXP (src, 1)) == ASHIFT
+  || GET_CODE (XEXP (src, 1)) == LSHIFTRT
+  || GET_CODE (XEXP (src, 1)) == AND)
+ && REG_P (XEXP (src, 0)
+   {
+ changed = REG_P (XEXP (src, 0))
+   ? pseudo_and_with_mask ( (SET_SRC (pat), 0))
+   : pseudo_and_with_mask ( (SET_SRC (pat), 1));
+ if (changed)
+   {
+ maybe_swap_commutative_operands (SET_SRC (pat));
+ //PIX_opt = true;
+   }
+   }
+  changed |= change_zero_ext (pat);
+}
   else if (GET_CODE (pat) == PARALLEL)
 {
   int i;
@@ -11585,6 +11630,8 @@ recog_for_combine (rtx *pnewpat, rtx_insn *insn, rtx
*pnotes)

   if (insn_code_number < 0)
undo_to_marker (marker);
+  //else if (PIX_opt)
+   //fprintf (stdout, "PIX applied\n");
 }

   return insn_code_number;

[Bug target/93453] PPC: rldimi not taken into account to avoid shift+or

2021-11-15 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93453

--- Comment #4 from HaoChen Gui  ---
For the second issue, I drafted following insn_and_split pattern. It tries to
combine the shift and ior when the nonzero_bits of operands[3] matches the
condition. 

(define_insn_and_split "*rotl3_insert_8"
  [(set (match_operand:GPR 0 "gpc_reg_operand" "=r")
(plus_ior_xor:GPR (ashift:GPR (match_operand:GPR 1 "gpc_reg_operand"
"r")
  (match_operand:SI 2 "const_int_operand"
"n"))
  (match_operand:GPR 3 "gpc_reg_operand" "0")))]
  "INTVAL (operands[2]) > 0
   && (nonzero_bits (operands[3], mode)
   < HOST_WIDE_INT_1U << INTVAL (operands[2]))"
{
  if (mode == SImode)
return "rlwimi %0,%1,%h2,0,31-%h2";
  else
return "rldimi %0,%1,%H2,0";
}
  "&& 1"
  [(set (match_dup 0)
(ior:GPR (and:GPR (match_dup 3)
  (match_dup 4))
 (ashift:GPR (match_dup 1)
 (match_dup 2]
{
  operands[4] = GEN_INT ((HOST_WIDE_INT_1U << INTVAL (operands[2])) - 1);
}
  [(set_attr "type" "insert")])

But I found that nonzero_bits can't return an exact value except in combine
pass. So the pattern finally can't be split to pattern of
'rotl3_insert_3'. Also if the pass after combine changes the insn, it
can't be recognized as the nonzero_bits doesn't return exact value in that
pass.

I am thinking if we can convert third operand to "reg and a mask" when the
nonzero_bits is known in combine pass. Thus the pattern can be directly
combined to 'rotl3_insert_3'.  

(set (reg:DI 123)
(ior:DI (ashift:DI (reg:DI 125)
(const_int 32 [0x20]))
(reg:DI 126)))

(set (reg:DI 123)
(ior:DI (ashift:DI (reg:DI 125)
   (const_int 32 [0x20]))
(and:DI (reg:DI 126)
(const_int 4294967295 [0xfff]

[Bug target/103124] PPC: "mr" instruction is unnecessary when extending DI to V1TI

2021-11-08 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103124

--- Comment #3 from HaoChen Gui  ---
My solution is to split the move (from TI to V1TI) into one vsx_concat_v2di and
one V2DI to V1TI move. Thus, TI register 122 can be decomposed.

(insn 12 11 17 2 (set (reg:V1TI 121 [ b ])
(subreg:V1TI (reg:TI 122 [ a ]) 0)) "test2.c":4:5 1167
{vsx_movv1ti_64bit}
 (expr_list:REG_DEAD (reg:TI 122 [ a ])
(nil)))

//after split pass
(insn 23 11 24 2 (set (reg:V2DI 125)
(vec_concat:V2DI (subreg:DI (reg:TI 122 [ a ]) 0)
(subreg:DI (reg:TI 122 [ a ]) 8))) "test2.c":4:5 -1
 (nil))
(insn 24 23 17 2 (set (reg:V1TI 121 [ b ])
(subreg:V1TI (reg:V2DI 125) 0)) "test2.c":4:5 -1
 (nil))

//after subreg pass
(insn 23 11 24 2 (set (reg:V2DI 125)
(vec_concat:V2DI (reg:DI 126 [ a ])
(reg:DI 127 [ a+8 ]))) "test2.c":4:5 1346 {vsx_concat_v2di}
 (nil))
(insn 24 23 17 2 (set (reg:V1TI 121 [ b ])
(subreg:V1TI (reg:V2DI 125) 0)) "test2.c":4:5 1167 {vsx_movv1ti_64bit}
 (nil))

[Bug target/93453] PPC: rldimi not taken into account to avoid shift+or

2021-11-08 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93453

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #2 from HaoChen Gui  ---
My solution is to split the move (from TI to V1TI) into one vsx_concat_v2di and
one V2DI to V1TI move. Thus, TI register 122 can be decomposed.

(insn 12 11 17 2 (set (reg:V1TI 121 [ b ])
(subreg:V1TI (reg:TI 122 [ a ]) 0)) "test2.c":4:5 1167
{vsx_movv1ti_64bit}
 (expr_list:REG_DEAD (reg:TI 122 [ a ])
(nil)))

//after split pass
(insn 23 11 24 2 (set (reg:V2DI 125)
(vec_concat:V2DI (subreg:DI (reg:TI 122 [ a ]) 0)
(subreg:DI (reg:TI 122 [ a ]) 8))) "test2.c":4:5 -1
 (nil))
(insn 24 23 17 2 (set (reg:V1TI 121 [ b ])
(subreg:V1TI (reg:V2DI 125) 0)) "test2.c":4:5 -1
 (nil))

//after subreg pass
(insn 23 11 24 2 (set (reg:V2DI 125)
(vec_concat:V2DI (reg:DI 126 [ a ])
(reg:DI 127 [ a+8 ]))) "test2.c":4:5 1346 {vsx_concat_v2di}
 (nil))
(insn 24 23 17 2 (set (reg:V1TI 121 [ b ])
(subreg:V1TI (reg:V2DI 125) 0)) "test2.c":4:5 1167 {vsx_movv1ti_64bit}
 (nil))

[Bug target/103124] PPC: "mr" instruction is unnecessary when extending DI to V1TI

2021-11-08 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103124

--- Comment #2 from HaoChen Gui  ---
//lower-subreg.c
  /* If this is a cast from one mode to another, where the modes
 have the same size, and they are not tieable, then mark this
 register as non-decomposable.  If we decompose it we are
 likely to mess up whatever the backend is trying to do.  */
  if (outer_words > 1
  && outer_size == inner_size
  && !targetm.modes_tieable_p (GET_MODE (x), GET_MODE (inner)))
{
  bitmap_set_bit (non_decomposable_context, regno);
  bitmap_set_bit (subreg_context, regno);
  iter.skip_subrtxes ();
  continue;
}

As TI and V1TI is not tieable on powerpc, TI register 122 in the following insn
can't be decomposed. 

(insn 12 11 13 2 (set (reg:V1TI 121 [ b ])
(subreg:V1TI (reg:TI 122) 0)) "test2.c":4:5 1167 {vsx_movv1ti_64bit}
 (nil))

[Bug target/103124] PPC: "mr" instruction is unnecessary when extending DI to V1TI

2021-11-07 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103124

--- Comment #1 from HaoChen Gui  ---
Build command gcc -O2 -S test.c -mcpu=power9

[Bug target/103124] New: PPC: "mr" instruction is unnecessary when extending DI to V1TI

2021-11-07 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103124

Bug ID: 103124
   Summary: PPC: "mr" instruction is unnecessary when extending DI
to V1TI
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: guihaoc at gcc dot gnu.org
  Target Milestone: ---

//test.c
vector __int128 init (long long a)
{
  vector __int128 b;
  b = (vector __int128) {a};
  return b;
}

gcc -O2 -s test.c -mcpu=power9

//p9 assembly
mr 10,3
sradi 11,3,63
mtvsrdd 34,11,10

The first mr is unnecessary if the last one is changed to "mtvsrdd 34,11,3".

[Bug target/102169] powerpc64 int memory operations using FP instructions

2021-09-29 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102169

HaoChen Gui  changed:

   What|Removed |Added

 CC||guihaoc at gcc dot gnu.org

--- Comment #4 from HaoChen Gui  ---
In this case, it picks up "GEN_OR_VSX_REGS" as FLOAT_REGS costs zero in ira
pass. There is a "d,Z" alternative pair in "*movsi_internal1" expand. When the
second operand is not a "indexed_or_indirect_operand", the reload is need. In
this case, the reload is needed when it's a d-form address and doesn't match
the 'Z'. So we should punish the reload of 'Z'. Change the alternative to '^Z'.

[Bug target/102146] [11 regression] several test cases fails after r11-8940

2021-09-02 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102146

--- Comment #1 from HaoChen Gui  ---
For pr81348.c, it was already fixed by r11-8941. Segher backported it. 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100952#c12

PASS: gcc.target/powerpc/pr81348.c (test for excess errors)
PASS: gcc.target/powerpc/pr81348.c scan-assembler \\mlha\\M
PASS: gcc.target/powerpc/pr81348.c scan-assembler \\mmtvsrwa\\M

[Bug target/101865] _ARCH_PWR8 is not defined when using -mcpu=power8

2021-08-25 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101865

--- Comment #9 from HaoChen Gui  ---
(In reply to Tulio Magno Quites Machado Filho from comment #7)
> (In reply to HaoChen Gui from comment #6)
> > Does _ARCH_PWR8 impact anything during the compiling?
> 
> I can answer this question from an user point of view. It's used in many
> projects to indicate if the target processor supports certain features, e.g.
> 
> #ifdef _ARCH_PWR8
> asm (...); /* Power8-specific code. */
> #else
> /* Generic implementation. */
> ...
> #endif

For this example, let's suppose that we set mcpu=power8 and mno-vsx in the
command line. Thus, _ARCH_PWR8 should be defined as mcpu=power8. But if the
Power8-specific codes contain VSX codes, could the asm be executed? I think we
just use the macro _ARCH_PWR8 here to guard which instructions are available.

[Bug target/101865] _ARCH_PWR8 is not defined when using -mcpu=power8

2021-08-24 Thread guihaoc at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101865

--- Comment #6 from HaoChen Gui  ---
(In reply to Segher Boessenkool from comment #5)
> (In reply to HaoChen Gui from comment #4)
> > I wonder if it's a Power8 architecture when those 6 options are all
> > disabled. Or it is regressed to Power7? The "_ARCH_PWR8" represents the
> > hardware architecture or the ISA it can be taken?
> 
> It says what version of the ISA is effective.  It should correspond to the
> -mcpu= used.
> 
> Even if it would be possible to use -mcpu=power8 and then other command line
> flags that set everything back to it being just like Power7 (architecturally
> that is, ignoring scheduling etc.), we should still define _ARCH_PWR8.

Thanks for your comments. If it's architecturally Power7, why we still need to
define _ARCH_PWR8. Does _ARCH_PWR8 impact anything during the compiling? Could
we just use mcpu instead? Seems _ARCH_PWR8 only affects instruction selection
right now.

  1   2   >