[Bug ipa/109509] Huge compile time with forced inlining

2023-04-14 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109509

--- Comment #1 from Chip Kerchner  ---
Just for note:  The same code that has heavy use always_inline compiles about
3X faster in LLVM and uses about 2X less memory to compile.

[Bug tree-optimization/109491] [11/12 Regression] Segfault in tree-ssa-sccvn.cc:expressions_equal_p()

2023-04-14 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109491

--- Comment #14 from Chip Kerchner  ---
Just one more question and then I'll switch to the new bug.

Would it help any if the functions that are "always_inline" be changed from
non-static to static?

Eigen's approach (where this code originally came from - yes, it could be
definite be better) is to use non-static inlined function.

[Bug tree-optimization/109491] [11/12 Regression] Segfault in tree-ssa-sccvn.cc:expressions_equal_p()

2023-04-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109491

--- Comment #12 from Chip Kerchner  ---
> having always_inline across a deep call stack can exponentially increase 
> compile-time

Do you think it would be worth requesting a feature to reduce the compilation
times in situations like this?  Ideally exponentially is not a good thing.

[Bug target/109501] vec_test_data_class defines missing

2023-04-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109501

--- Comment #8 from Chip Kerchner  ---
Well, then I'm asking GCC to add these to make it easier to use
`vec_test_data_class`

[Bug target/109501] vec_test_data_class defines missing

2023-04-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109501

--- Comment #5 from Chip Kerchner  ---
Here's a testcase

```
#include 
#include 

int main()
{
  __vector float p4f = { float(0), float(1), float(2), float(3) };
  __vector __bool int nan_selector = vec_test_data_class(p4f,
__VEC_CLASS_FP_NAN);

  return 0;
}
```

```
NAN_defines.cpp: In function ‘int main()’:
NAN_defines.cpp:7:63: error: ‘__VEC_CLASS_FP_NAN’ was not declared in this
scope
7 |   __vector __bool int nan_selector = vec_test_data_class(p4f,
__VEC_CLASS_FP_NAN);
  |  
^~
```

```
/opt/gcc-nightly/trunk/bin/g++ -O3 -mcpu=power9 NAN_defines.cpp

[Bug target/109501] vec_test_data_class defines missing

2023-04-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109501

--- Comment #4 from Chip Kerchner  ---
PowerPC LE - P9.

Yes, other PVIPR APIs are available and compile in more source code.

[Bug c++/109501] vec_test_data_class defines missing

2023-04-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109501

--- Comment #2 from Chip Kerchner  ---
'__VEC_CLASS_FP_NAN' was not declared in this scope

[Bug c++/109501] vec_test_data_class defines missing

2023-04-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109501

Chip Kerchner  changed:

   What|Removed |Added

 CC||chip.kerchner at ibm dot com

--- Comment #1 from Chip Kerchner  ---
```
  __vector float p4f = some data;

 1645 |   __vector __bool int nan_selector = vec_test_data_class(p4f,
__VEC_CLASS_FP_NAN);
```

[Bug c++/109501] New: vec_test_data_class defines missing

2023-04-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109501

Bug ID: 109501
   Summary: vec_test_data_class defines missing
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: chip.kerchner at ibm dot com
  Target Milestone: ---

These defines seems to be missing

#define __VEC_CLASS_FP_NAN (1<<6)
#define __VEC_CLASS_FP_INFINITY_P (1<<5)
#define __VEC_CLASS_FP_INFINITY_N (1<<4)
#define __VEC_CLASS_FP_ZERO_P (1<<3)
#define __VEC_CLASS_FP_ZERO_N (1<<2)
#define __VEC_CLASS_FP_SUBNORMAL_P (1<<1)
#define __VEC_CLASS_FP_SUBNORMAL_N (1<<0)
#define __VEC_CLASS_FP_INFINITY (__VEC_CLASS_FP_INFINITY_P
| __VEC_CLASS_FP_INFINITY_N)
#define __VEC_CLASS_FP_ZERO (__VEC_CLASS_FP_ZERO_P | __VEC_CLASS_FP_ZERO_N)
#define __VEC_CLASS_FP_SUBNORMAL (__VEC_CLASS_FP_SUBNORMAL_P
| __VEC_CLASS_FP_SUBNORMAL_N)
#define __VEC_CLASS_FP_NOT_NORMAL (__VEC_CLASS_FP_NAN |
__VEC_CLASS_FP_SUBNORMAL
| __VEC_CLASS_FP_ZERO | __VEC_CLASS_FP_INFINITY)

[Bug target/70243] PowerPC V4SFmode should not use Altivec instructions on VSX systems

2023-04-05 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70243

--- Comment #4 from Chip Kerchner  ---
It shows up as a rounding difference on BE machines.

[Bug target/70243] PowerPC V4SFmode should not use Altivec instructions on VSX systems

2023-04-05 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70243

Chip Kerchner  changed:

   What|Removed |Added

 CC||chip.kerchner at ibm dot com

--- Comment #3 from Chip Kerchner  ---
This is showing up in some of the binaries generated by Eigen (with GCC13).

[Bug target/109116] vector_pair register allocation bug

2023-03-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109116

--- Comment #2 from Chip Kerchner  ---
This could be a bigger issue with register allocation after the disassemble of
an opaque object like vector_pair or MMA.

[Bug rtl-optimization/109116] vector_pair register allocation bug

2023-03-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109116

--- Comment #1 from Chip Kerchner  ---
This has been in GCC since the initial version that supported __vector_pair
(10.x)

[Bug rtl-optimization/109116] New: vector_pair register allocation bug

2023-03-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109116

Bug ID: 109116
   Summary: vector_pair register allocation bug
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: chip.kerchner at ibm dot com
  Target Milestone: ---

There seems to be a bug in the register allocator when using a __vector_pair. 
GCC didn't choose a register for the load that served the later instruction.

With this testcase

```
#include 

#if !__has_builtin(__builtin_vsx_disassemble_pair)
#define __builtin_vsx_disassemble_pair __builtin_mma_disassemble_pair
#endif

int main() {
  float A[8] = { float(1), float(2), float(3), float(4),
 float(5), float(6), float(7), float(8) };
  __vector_pair P;
  __vector_quad Q;
  vector float B, C[2], D[4];

  __builtin_mma_xxsetaccz();
  P = *reinterpret_cast<__vector_pair *>(A);
  B = *reinterpret_cast(A);
  __builtin_vsx_disassemble_pair((void*)(C), );
  __builtin_mma_xvf32gerpp(, reinterpret_cast<__vector unsigned char>(C[0]),
reinterpret_cast<__vector unsigned char>(B));
  __builtin_mma_xvf32gerpp(, reinterpret_cast<__vector unsigned char>(C[1]),
reinterpret_cast<__vector unsigned char>(B));
  __builtin_mma_disassemble_acc((void *)D, );

  return int(D[0][0]);
}
```

It produces an output with extra (unneeded) register moves.

```
plxvp 12,.LANCHOR0@pcrel
xxsetaccz 0
plxv 33,.LC1@pcrel
xxlor 45,13,13
xxlor 32,12,12
xvf32gerpp 0,45,33
xvf32gerpp 0,32,33
xxmfacc 0
```

[Bug rtl-optimization/108757] We do not simplify (a - (N*M)) / N + M -> a / N

2023-02-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108757

--- Comment #16 from Chip Kerchner  ---
Dang copy and paste issue...  This is what I meant.

unsigned long int
foo (unsigned long int a)
{
  return (a + (N*M)) / N - M;
}

[Bug rtl-optimization/108757] We do not simplify (a - (N*M)) / N + M -> a / N

2023-02-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108757

--- Comment #15 from Chip Kerchner  ---
How about this (from Peter's testcase)?  Does it still have issues?  It
produces the same assembly.

#define N 32
#define M 2

unsigned long int
foo (unsigned long int a)
{
  return (a - (N*M)) / N + M;
}

[Bug tree-optimization/108757] We do not simplify (a - (N*M)) / N + M -> a / N

2023-02-13 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108757

--- Comment #12 from Chip Kerchner  ---
Here is an example of the original problem

#define EIGEN_ALWAYS_INLINE __attribute__((always_inline)) inline

typedef __vector float Packet4f;
typedef size_t Index;

EIGEN_ALWAYS_INLINE Packet4f ploadu(const float* from)
{
  return vec_xl(0, const_cast(from));
}

EIGEN_ALWAYS_INLINE void pstoreu(float* to, const Packet4f )
{
  vec_xst(from, 0, to);
}

void convert(Index rows, float*src, float *result)
{
  for(Index i = 0; i + 4 <= rows; i+=4) {
Packet4f r32_0 = ploadu(src + i +  0);
pstoreu(result + i +  0, r32_0);
  }
}

And the output (with notation on the lines in question)

cmpldi 0,3,3
blelr 0
addi 3,3,-4  <- i = rows - 4
li 9,0
srdi 3,3,2   <- i >>= 2
addi 8,3,1   <- i = i + 1
andi. 7,8,0x3
mr 10,8
beq 0,.L10
cmpdi 0,7,1
beq 0,.L14
cmpdi 0,7,2
beq 0,.L15
lxv 0,0(4)
mr 8,3
li 9,16
stxv 0,0(5)
.L15:
lxvx 0,4,9
addi 8,8,-1
stxvx 0,5,9
addi 9,9,16
.L14:
lxvx 0,4,9
cmpdi 0,8,1
stxvx 0,5,9
addi 9,9,16
beqlr 0
.L10:
srdi 10,10,2
mtctr 10
.L3:
lxvx 0,4,9
addi 10,9,16
addi 7,9,32
addi 8,9,48
stxvx 0,5,9
lxvx 0,4,10
addi 9,9,64
stxvx 0,5,10
lxvx 0,4,7
stxvx 0,5,7
lxvx 0,4,8
stxvx 0,5,8
bdnz .L3
blr

In this case the 3 lines notated can be replaced a simple `srdi 8,3,2`

[Bug tree-optimization/108757] We do not simplify (a - (N*M)) / N + M -> a / N

2023-02-11 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108757

--- Comment #11 from Chip Kerchner  ---
Nevermind, using a similar example that Segher gave, it would failed too.

[Bug tree-optimization/108757] We do not simplify (a - (N*M)) / N + M -> a / N

2023-02-11 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108757

--- Comment #10 from Chip Kerchner  ---
Oops that should be 31 * -2, not 33.

[Bug tree-optimization/108757] We do not simplify (a - (N*M)) / N + M -> a / N

2023-02-11 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108757

--- Comment #9 from Chip Kerchner  ---
Doesn't this work for powers of two (N) and signed values (for A, N and M)?

(59 - (33 * -2)) / -2 + 31 = -62 + 31 = -29

and

59 / -2 = -29

[Bug ipa/102059] Incorrect always_inline diagnostic in LTO mode with #pragma GCC target("cpu=power10")

2021-09-15 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102059

--- Comment #24 from Chip Kerchner  ---
(In reply to Kewen Lin from comment #23)
> Hi Chip, I can reproduce this error with trunk. With some investigation, I
> think it's not duplicated of this PR, some information restoring seems wrong
> when lto. Could you please file a separated PR?  Thanks in advance!

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102347

[Bug lto/102347] New: "fatal error: target specific builtin not available" with MMA and LTO

2021-09-15 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102347

Bug ID: 102347
   Summary: "fatal error: target specific builtin not available"
with MMA and LTO
   Product: gcc
   Version: 10.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: chip.kerchner at ibm dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

I'm seeing MMA problems with LTO.  With this simple program (main.ii)

--
#pragma GCC target "cpu=power10"
int main() {
  float *b;
  __vector_quad c;
  __builtin_mma_disassemble_acc(b, );
  return 0;
}
--

And this compile

--
g++ -flto=auto -mcpu=power9 main.ii
--

I'm seeing this error (which does NOT occur without LTO)

--
lto1: error: '__builtin_mma_xxmfacc_internal' requires the '-mmma' option
lto1: fatal error: target specific builtin not available
compilation terminated.
--

[Bug ipa/102059] Incorrect always_inline diagnostic in LTO mode with #pragma GCC target("cpu=power10")

2021-09-14 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102059

--- Comment #22 from Chip Kerchner  ---
(In reply to Chip Kerchner from comment #21) - Forgot one line of code
> --
> #pragma GCC target "cpu=power10"
> int main() {
>   float *b;
>   __vector_quad c;
>   __builtin_mma_disassemble_acc(b, );
>   return 0;
> }
> --

[Bug ipa/102059] Incorrect always_inline diagnostic in LTO mode with #pragma GCC target("cpu=power10")

2021-09-14 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102059

--- Comment #21 from Chip Kerchner  ---
I'm also seeing MMA problems with LTO.  With this simple program (main.ii)

--
int main() {
  float *b;
  __vector_quad c;
  __builtin_mma_disassemble_acc(b, );
  return 0;
}
--

And this compile (using gcc 10.3.1)

--
g++ -flto=auto -mcpu=power9 main.ii
--

I'm seeing this error (which does NOT occur without LTO)

--
lto1: error: '__builtin_mma_xxmfacc_internal' requires the '-mmma' option
lto1: fatal error: target specific builtin not available
compilation terminated.
--

[Bug ipa/102059] Incorrect always_inline diagnostic in LTO mode with #pragma GCC target("cpu=power10")

2021-08-25 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102059

Chip Kerchner  changed:

   What|Removed |Added

 CC||chip.kerchner at ibm dot com

--- Comment #3 from Chip Kerchner  ---
It also fails with version 10.3

[Bug c++/92031] [9 Regression] Incorrect "taking address of r-value" error

2021-04-19 Thread chip.kerchner at ibm dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92031

Chip Kerchner  changed:

   What|Removed |Added

 CC||chip.kerchner at ibm dot com

--- Comment #5 from Chip Kerchner  ---
Please backport this fix into 9.4 or 9.5.