[Bug target/100799] Stackoverflow in optimized code on PPC

2024-03-22 Thread aagarwa at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #34 from Ajit Kumar Agarwal  ---
Sent the patch for review.

Here is the patch:
PATCH] rs6000: Stackoverflow in optimized code on PPC (PR100799)

When using FlexiBLAS with OpenBLAS we noticed corruption of
the parameters passed to OpenBLAS functions. FlexiBLAS
basically provides a BLAS interface where each function
is a stub that forwards the arguments to a real BLAS lib,
like OpenBLAS.

Fixes the corruption of caller frame checking number of
arguments is less than equal to GP_ARG_NUM_REG (8)
excluding hidden unused DECLS.

2024-03-22  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR rtk-optimization/100799
* config/rs600/rs600-calls.cc (rs6000_function_arg): Don't
generate parameter save area if number of arguments passed
less than equal to GP_ARG_NUM_REG (8) excluding hidden
paramter.
* function.cc (assign_parms_initialize_all): Check for hidden
parameter in fortran code and set the flag hidden_string_length
and actual paramter passed excluding hidden unused DECLS.
* function.h: Add new field hidden_string_length and
actual_parm_length in function structure.

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-03-22 Thread aagarwa at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #33 from Ajit Kumar Agarwal  ---
Sent the patch for review.

Here is the patch:
PATCH] rs6000: Stackoverflow in optimized code on PPC (PR100799)

When using FlexiBLAS with OpenBLAS we noticed corruption of
the parameters passed to OpenBLAS functions. FlexiBLAS
basically provides a BLAS interface where each function
is a stub that forwards the arguments to a real BLAS lib,
like OpenBLAS.

Fixes the corruption of caller frame checking number of
arguments is less than equal to GP_ARG_NUM_REG (8)
excluding hidden unused DECLS.

2024-03-22  Ajit Kumar Agarwal  

gcc/ChangeLog:

PR rtk-optimization/100799
* config/rs600/rs600-calls.cc (rs6000_function_arg): Don't
generate parameter save area if number of arguments passed
less than equal to GP_ARG_NUM_REG (8) excluding hidden
paramter.
* function.cc (assign_parms_initialize_all): Check for hidden
parameter in fortran code and set the flag hidden_string_length
and actual paramter passed excluding hidden unused DECLS.
* function.h: Add new field hidden_string_length and
actual_parm_length in function structure.

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-03-01 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Peter Bergner  changed:

   What|Removed |Added

 CC||aagarwa at gcc dot gnu.org

--- Comment #32 from Peter Bergner  ---
(In reply to Peter Bergner from comment #31)
> Ok, I think that gives us some idea what needs to be done.  I'll look for
> someone in the team to have a look at implementing this workaround.  Thanks.

Ajit has agreed to try and implement the workaround.

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-02-27 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #31 from Peter Bergner  ---
(In reply to Jakub Jelinek from comment #30)
> Either tree parmdef = ssa_default_def (cfun, parm) is NULL, or has_zero_uses
> (parmdef).
> Not sure if has_zero_uses will work properly after some bbs are converted
> from GIMPLE to RTL, but maybe it will, I think the expansion generally
> doesn't gsi_remove statements it expands nor calls update_stmt on them.  One
> could always also just compute in generic code at the start of expansion the
> number of unused DECL_HIDDEN_STRING_LENGTH PARM_DECLs at the end of the
> argument list, save that as a flag in struct function or where and let the
> backends use it from there.

Ok, I think that gives us some idea what needs to be done.  I'll look for
someone in the team to have a look at implementing this workaround.  Thanks.

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-02-26 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #30 from Jakub Jelinek  ---
Either tree parmdef = ssa_default_def (cfun, parm) is NULL, or has_zero_uses
(parmdef).
Not sure if has_zero_uses will work properly after some bbs are converted from
GIMPLE to RTL, but maybe it will, I think the expansion generally doesn't
gsi_remove statements it expands nor calls update_stmt on them.  One could
always also just compute in generic code at the start of expansion the number
of unused DECL_HIDDEN_STRING_LENGTH PARM_DECLs at the end of the argument list,
save that as a flag in struct function or where and let the backends use it
from there.

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-02-26 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #29 from Peter Bergner  ---
(In reply to Jakub Jelinek from comment #28)
> Yes, so it is the backend that told function.cc that there is a parameter
> save area and it should be adding REG_EQUIV notes.  So, the idea would be
> that for the case we talk about (<= 8 normal arguments, then only unused
> DECL_HIDDEN_STRING_LENGTH ones) that the backend would also say that there
> is no parameter save area, basically pretend there are <= 8 arguments.

How can we know there are no uses of the hidden arg(s)?  That backend function
is being called at expand time, so we haven't yet run any RTL dataflow
information to tell us.  Is there some tree attribute for the arg that can tell
is whether it's used or not?  ...or is there some SSA data for that arg that
can show it has no use?  ...and if so, would that still work for -O0 compiles?

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-02-26 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #28 from Jakub Jelinek  ---
(In reply to Peter Bergner from comment #27)
> So I looked closer at what the failure mode was in this PR (versus the one
> you're seeing with flexiblas).  As in your case, there is a mismatch in the
> number of parameters the C caller thinks there are (8 args, so no param save
> area needed) versus what the Fortran callee thinks there are (9 params which
> include the one hidden arg, so there is a param save area).  The Fortran
> function doesn't actually access the hidden argument in our test case above,
> in fact the character argument is never used either.  What I see in the rtl
> dumps is that *all* incoming args have a REG_EQUIV generated that points to
> the param save area (this doesn't happen when there are 8 or fewer formal
> params), even for the first 8 args that are passed in registers:

Yes, so it is the backend that told function.cc that there is a parameter save
area and it should be adding REG_EQUIV notes.  So, the idea would be that for
the case we talk about (<= 8 normal arguments, then only unused
DECL_HIDDEN_STRING_LENGTH ones) that the backend would also say that there is
no parameter save area, basically pretend there are <= 8 arguments.

> > Doing the workaround on the caller side is impossible, this is for calls
> > from C/C++ to Fortran code, directly or indirectly called and there is
> > nothing the compiler could use to guess that it actually calls Fortran code
> > with hidden Fortran character arguments.
> As a HUGE hammer, every caller could always allocate a param save area. 
> That would "fix" the problem from this bug, but would that also fix the bug
> you're seeing in flexiblas?

Most likely yes.  Though of course that is way too high price to pay, even with
some non-default option.  If we can't workaround it in the backend just on the
callee side of calls which have the unused hidden string length arguments, then
better no changes
on the GCC side.

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-02-24 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #27 from Peter Bergner  ---
(In reply to Jakub Jelinek from comment #26)
> But I still think the workaround is possible on the callee side.
> Sure, if the DECL_HIDDEN_STRING_LENGTH argument(s) is(are) used in the
> function, then there is no easy way but expect the parameter save area (ok,
> sure, it could just load from the assumed parameter location and don't
> assume the rest is there, nor allow storing to the slots it loaded them
> from).
> But that is actually not what BLAS etc. suffers from.
[snip]
> So, the workaround could be for the case of unused DECL_HIDDEN_STRING_LENGTH
> arguments at the end of PARM_DECLs don't try to load those at all and don't
> assume there is parameter save area unless the non-DECL_HIDDEN_STRING_LENGTH
> or used DECL_HIDDEN_STRING_LENGTH arguments actually require it.
So I looked closer at what the failure mode was in this PR (versus the one
you're seeing with flexiblas).  As in your case, there is a mismatch in the
number of parameters the C caller thinks there are (8 args, so no param save
area needed) versus what the Fortran callee thinks there are (9 params which
include the one hidden arg, so there is a param save area).  The Fortran
function doesn't actually access the hidden argument in our test case above, in
fact the character argument is never used either.  What I see in the rtl dumps
is that *all* incoming args have a REG_EQUIV generated that points to the param
save area (this doesn't happen when there are 8 or fewer formal params), even
for the first 8 args that are passed in registers:

(insn 2 12 3 2 (set (reg/v/f:DI 117 [ r3 ])
(reg:DI 3 3 [ r3 ])) "callee-3.c":6:1 685 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 32 [0x20])) [1 r3+0 S8 A64])
(nil)))
(insn 3 2 4 2 (set (reg/v:DI 118 [ r4 ])
(reg:DI 4 4 [ r4 ])) "callee-3.c":6:1 685 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 40 [0x28])) [2 r4+0 S8 A64])
(nil)))
...

We then get to RA and we end up spilling one of the pseudos associated with one
of the other parameters (not the character param JOB).  LRA then uses that
REG_EQUIV note and rather than allocating a new stack slot to spill to, it uses
the parameter save memory location for that parameter for the spill slot.  When
we store to that memory location and the C caller has not allocated the param
save area, we end up clobbering an important part of the C callers stack
causing a crash.

If we were to try and do a callee workaround, we would need to disable setting
those REG_EQUIV notes for the parameters... if that's even possible.  Since
Fortran uses call-by-name parameter passing, isn't the updated param value from
the callee returned in the parameter save area itself???


> Doing the workaround on the caller side is impossible, this is for calls
> from C/C++ to Fortran code, directly or indirectly called and there is
> nothing the compiler could use to guess that it actually calls Fortran code
> with hidden Fortran character arguments.
As a HUGE hammer, every caller could always allocate a param save area.  That
would "fix" the problem from this bug, but would that also fix the bug you're
seeing in flexiblas?

I'm not advocating this though.  I was thinking maybe making callers (under an
option?) conservatively assume the callee is a Fortran function and for those C
arguments that could map to a Fortran parameter with a hidden argument, bump
the number of counted args by 1.  For example, a C function with 2 char/char *
args and 6 int args would think there are 8 normal args and 2 hidden args, so
it needs to allocate a param save area.  Is that not feasible?  ...or does that
not even address the issue you're seeing in your bug?

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-02-22 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #26 from Jakub Jelinek  ---
(In reply to Peter Bergner from comment #25)
> CCing Mike and David for possible comments about the possible workarounds
> mentioned in Comment 23 and Comment 24.

Doing the workaround on the caller side is impossible, this is for calls from
C/C++ to Fortran code, directly or indirectly called and there is nothing the
compiler could use to guess that it actually calls Fortran code with hidden
Fortran character arguments.
But I still think the workaround is possible on the callee side.
Sure, if the DECL_HIDDEN_STRING_LENGTH argument(s) is(are) used in the
function, then there is no easy way but expect the parameter save area (ok,
sure, it could just load from the assumed parameter location and don't assume
the rest is there, nor allow storing to the slots it loaded them from).
But that is actually not what BLAS etc. suffers from.
If you have something like
subroutine foo (a, b, c, d, e, f, g, h)
  character a
  integer b, c, d, e, f, g, h
  call bar (a, b, c, d, e, f, g, h)
end subroutine foo
then the DECL_HIDDEN_STRING_LENGTH argument isn't used at all, on the callee
side the user said that one should treat it as if the length of a is 1, so
whatever the caller passes is unimportant and when passing to further calls it
will just use 1:
void foo (character(kind=1)[1:1] & restrict a, integer(kind=4) & restrict b,
integer(kind=4) & restrict c, integer(kind=4) & restrict d, integer(kind=4) &
restrict e, integer(kind=4) & restrict f, integer(kind=4) & restrict g,
integer(kind=4) & restrict h, integer(kind=8) _a)
{
   :
  bar (a_2(D), b_3(D), c_4(D), d_5(D), e_6(D), f_7(D), g_8(D), h_9(D), 1);
  return;

}
It would seem that the _a argument is useless, but as explained in PR90329 that
is because in Fortran you can call foo ("foo", 1, 2, 3, 4, 5, 6, 7) without
interfaces etc.
and the first argument could be character, character(len=1), character(len=3)
or character(len=*) etc.  And only in the last case the argument is actually
needed, in other cases it is ignored.

So, the workaround could be for the case of unused DECL_HIDDEN_STRING_LENGTH
arguments at the end of PARM_DECLs don't try to load those at all and don't
assume there is parameter save area unless the non-DECL_HIDDEN_STRING_LENGTH or
used DECL_HIDDEN_STRING_LENGTH arguments actually require it.

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-02-22 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Peter Bergner  changed:

   What|Removed |Added

 CC||dje at gcc dot gnu.org,
   ||meissner at gcc dot gnu.org

--- Comment #25 from Peter Bergner  ---
CCing Mike and David for possible comments about the possible workarounds
mentioned in Comment 23 and Comment 24.

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-02-21 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #24 from Peter Bergner  ---
(In reply to Jakub Jelinek from comment #23)
> if the PowerPC backend maintainers wanted, there could be a similar workaround
> on the rs6000 backend side, in the decisions whether the callee can use
> the parameter save area or not ignore counting DECL_HIDDEN_STRING_LENGTH
> PARM_DECLs, so if e.g. 9 arguments are passed but one of them is
> DECL_HIDDEN_STRING_LENGTH, assume parameter save area is not there.

If the callee has 9 arguments, even if one is a hidden str len arg, then there
MUST be a parameter save area, since that is where the callee is supposed to
load the 9th argument from.  There is simply no other location that 9th
argument exists at.

I think the only viable rs6000 workaround is for the caller to allocate a
parameter save area in some cases where it doesn't think it needs one.  Ie, the
caller is calling a function which it thinks has 8 parameters and there might
be a hidden one (maybe one param is a string or whatever the Fortran CHARACTER
with len great than 1 maps to) because the callee might be a Fortran routine. 
That would solve the problem of the callee scribbling data into the caller's
frame, but wouldn't solve the issue of the caller didn't actually place a valid
value for the missing hidden parameter.  Thoughts on that?

[Bug target/100799] Stackoverflow in optimized code on PPC

2024-02-20 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #23 from Jakub Jelinek  ---
Note, given that in PR90329 a workaround has been introduced for such buggy
cases (that time to disallow functions with the DECL_HIDDEN_STRING_LENGTH
arguments from making certain tail-calls and call them normally instead), if
the PowerPC backend maintainers wanted, there could be a similar workaround on
the rs6000 backend side,
in the decisions whether the callee can use the parameter save area or not
ignore counting DECL_HIDDEN_STRING_LENGTH PARM_DECLs, so if e.g. 9 arguments
are passed but one of them is DECL_HIDDEN_STRING_LENGTH, assume parameter save
area is not there.

[Bug target/100799] Stackoverflow in optimized code on PPC

2023-06-19 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Peter Bergner  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|WAITING |RESOLVED

--- Comment #22 from Peter Bergner  ---
I'm closing this as NOT A BUG in GCC and is a bug in the source code being
compiled not being cognizant of the rules between calling between fortran and
C.  Surya listed two solutions which can be used in Comment #21 below.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-11-09 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Surya Kumari Jangala  changed:

   What|Removed |Added

 Status|ASSIGNED|WAITING

--- Comment #21 from Surya Kumari Jangala  ---
There are two options to resolve the issue:

1. Use the BIND(C) directive on the fortran callee (DGEBAL) to make it
interoperable with the caller which is written in C. As described in comment
19, using this directive removed accesses to the caller's frame.

2. As described in
(https://gcc.gnu.org/onlinedocs/gfortran/Argument-passing-conventions.html),
since the first parameter to DGEBAL is of type CHARACTER, there is an extra
hidden argument. Change the call to DGEBAL from dgebal (the flexiBLAS wrapper
routine) to take an extra argument. This causes the compiler to allocate a
parameter save area in dgebal's frame, as there are now 9 parameters but only 8
parameter registers.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-10-30 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Kewen Lin  changed:

   What|Removed |Added

 CC||linkw at gcc dot gnu.org

--- Comment #20 from Kewen Lin  ---
(In reply to Alan Modra from comment #4)
> The disassembly says this is powerpc64le.  Possibly interesting fact: the
> offsets used above the stack frame are 400, 432, 440, which all correspond
> to the parameter save area.  I don't see any reason that DGEBAL should have
> a parameter save area though since all parameters can be passed in regs.

This also confuses me, since the function prototype

  SUBROUTINE DGEBAL( JOB, N, A, LDA, ILO, IHI, SCALE, INFO )

only has eight parameters, by looking into it the reason is that the first
parameter

  "CHARACTER JOB"

has one more hidden associated length argument.

"For arguments of CHARACTER type, the character length is passed as a hidden
argument at the end of the argument list. " as said in [1], so this function
actually has nine (more than eight) doubleword arguments, then it does need one
parameter save area.

[1] https://gcc.gnu.org/onlinedocs/gfortran/Argument-passing-conventions.html

Surya's analysis looks reasonable to me, the current stub scheme with function
pointer call in C doesn't match the Fortran side.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-10-17 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #19 from Surya Kumari Jangala  ---
There is a keyword called BIND(C) which can be specified on a Fortran procedure
to make it interoperable.
I tried this keyword on DGEBAL fortran routine which is a part of the openblas
library and it worked! I did not see any REG_EQUIV notes after the expand pass,
and the final assembly did not have accesses to the caller's frame.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-10-17 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #18 from Surya Kumari Jangala  ---
I git cloned and built flexiblas to see what is the frame size and what is the
assembly code generated for the flexiblas C wrapper routine for dgebal.

The important assembly code snippets for dgebal.c :

// r23-r31 are saved in the callee frame
   std r23,-72(r1)
   std r24,-64(r1)
   ...
   ...
   std r31,-8(r1)

// allocate the stack frame
   stdur1,-112(r1)

// save the parameter registers r3-r10 into r23-r30
   mr  r30,r3
   ...
   mr  r23,r10

// some of the param regs are used as temps
   ld  r3,0(r31)
   lwz r11,16(r3)

// populate the param registers appropriately
   mr  r3,r30
   ...
   mr  r10,r23

// make the call to the fortran dgebal routine
   bctrl

// restore r1
   addir1,r1,112

// restore r23-r31
   ld  r23,-72(r1)
   ...
   ld  r31,-8(r1)

// return
   blr

As we can see, the frame size allocated is only 112 out of which 32 is for
things like LR, TOC etc. and 72 is needed to save r23-r31. So clearly, the
wrapper routine is not allocating any parameter save area in it's frame.
Now, the dgebal fortran routine writes into the caller's frame thereby
corrupting a callee save register (one of r23-r31). So when control returns
back from the wrapper routine to the fortran routine dgeev, we see a corrupted
value.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-10-17 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #17 from Surya Kumari Jangala  ---
I analysed the reduced test case specified in comment 15. In the .s file, the
callee decrements r1 by 224, ie, callee’s frame size is 224. But there is an
instruction in the callee that accesses into the caller’s frame at (r1+272).
At first glance this looks odd, even incorrect, but after further analysis, I
am not sure if this is incorrect.
If we look at the RTL dumps, the offset 272 is introduced in ‘reload’. ‘Insn 4’
stores into (r1+272). 

‘Insn 4’ after vregs:

(insn 4 3 5 2 (set (reg/v/f:DI 177 [ arrayD.2714 ])
(reg:DI 5 5 [ arrayD.2714 ])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
(nil)))


‘Insn 4’ after IRA:

(insn 4 214 237 2 (set (reg/v/f:DI 177 [ arrayD.2714 ])
(reg:DI 262)) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_DEAD (reg:DI 262)
(expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
(nil

‘Insn 4’ after reload:

(insn 4 214 19 2 (set (mem/f/c:DI (plus:DI (reg/f:DI 1 1)
(const_int 272 [0x110])) [3 arrayD.2714+0 S8 A64])
(reg:DI 5 5 [262])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
(nil)))


As we can see, during vregs phase, we are moving r5 to r177 and r177 is equiv
to (ap+48). ‘ap’ (r99) is the base register for access to arguments of the
function.

In the gcc code:
#define ARG_POINTER_REGNUM 99

During vregs phase, not just r5, but all registers from r3-r10 are moved to
pseudo registers and these pseudo regs are equivalent to (ap+’offset’) with
‘offset’ starting from 32 for r3 and going on till 88 for r10. Note that ap
points to the beginning of the callee frame, hence to access the parameter save
area of the caller’s frame, 32 needs to be added to ap.

During LRA, in curr_insn_transform(), we make equivalence substitution and
change r177 to r1+272. (272 because r177 is equivalent to ap+48, and ap equals
r1+224, so ap+48 = r1+272). 

The argument registers r3-r10 are saved as they need to be reused to pass
parameters to functions called from the callee. But not all parameter registers
are spilled to the stack. For example, r6 is saved in r24. We can see this
after the “final” phase:

(insn 5 289 19 (set (reg/v/f:DI 24 %r24 [orig:178 ldaD.2715 ] [178])
(reg:DI 6 %r6 [263])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 56 [0x38])) [6 ldaD.2715+0 S8 A64])
(nil)))

I guess r5 had to be spilled to stack because there were no free registers.

Also, note that there is a load from (r1+272) in the reduced test case. This
shows that the value in r5 is needed, and hence it has to be saved somewhere.

I ran the test case with the options: -mcpu=power8 -O2 -fPIC

If -fPIC option is removed, we do not see any access to the caller’s frame in
the generated assembly. But it does have instructions that save the parameter
registers into other registers. I suppose the parameter registers did not have
to be saved on stack (ie, in the caller’s parameter save area) because there
were enough registers available. That is, perhaps there is lesser register
pressure without -fPIC.

After vregs:
(insn 4 3 5 2 (set (reg/v/f:DI 177 [ arrayD.2714 ])
(reg:DI 5 %r5 [ arrayD.2714 ])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])

After reload:
(insn 4 214 19 2 (set (reg/v/f:DI 17 %r17 [orig:177 arrayD.2714 ] [177])
(reg:DI 5 %r5 [262])) "bug.f":1:23 675 {*movdi_internal64}
 (expr_list:REG_EQUIV (mem/f/c:DI (plus:DI (reg/f:DI 99 ap)
(const_int 48 [0x30])) [3 arrayD.2714+0 S8 A64])
(nil)))


To summarise, the reduced testcase seems to be correctly compiled. So I shifted
my focus to the original fortran file dgebal.f in the openBLAS library.


In dgebal.f too we have some instructions accessing the caller’s parameter save
area. These are the interesting snippets of instructions from the assembly
code: 

   // The original contents of r23 are spilled.
std %r23,-192(%r1)
   // r3 is saved in r23
mr %r23,%r3
   // frame is allocated
stdu %r1,-400(%r1)

  // restore r3 contents before making call to lsame_. There are several calls
to lsame_ and 
  // each time, r3 is restored.
mr %r3,%r23
bl lsame_

   // save r23 to the stack because we are running out of registers and we need
a free reg.
   // Note that we are saving to the caller’s frame into the parameter save
area. And we 
   // are saving to (400+32) which is the
   // location that r3 would have been 

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-09-20 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #16 from Segher Boessenkool  ---
It cannot be -mcpu=power8, that cannot generate isel.  -mcpu=power9 comes
closer, but I still do not see exactly the same output, and crucially not
the strange store either.

What the what.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-09-18 Thread jskumari at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #15 from Surya Kumari Jangala  ---
(In reply to Segher Boessenkool from comment #14)
> What is the exact command line (and relevant configuration!) required to
> reproduce this?

The reduced testcase is:

  SUBROUTINE DGEBAL( JOB, N, ARRAY, LDA, ILO, IHI, SCALE, INFO )
  CHARACTER  JOB
  DOUBLE PRECISION   ARRAY( LDA, * ), SCALE( * )
  LOGICALNOCONV
  140 CONTINUE
  DO 200 I = K, L
 C = DNRM2( L-K+1, ARRAY( K, I ), 1 )
 R = DNRM2( L-K+1, ARRAY( I, K ), LDA )
 ICA = IDAMAX( L, ARRAY( 1, I ), 1 )
 CA = ABS( ARRAY( ICA, I ) )
 IF( C.EQ.ZERO .OR. R.EQ.ZERO )
 $  GO TO 200
 IF( G.LT.R .OR. MAX( R, RA ).GE.SFMAX2 .OR.
 $   MIN( F, C, G, CA ).LE.SFMIN2 )GO TO 190
 F = F / SCLFAC
 G = G / SCLFAC
  190CONTINUE
 CALL DSCAL( N-K+1, G, ARRAY( I, K ), LDA )
  200 CONTINUE
  IF( NOCONV )
 $   GO TO 140
  END


The options to use to reproduce: -mcpu=power8 -O2 -fPIC

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-09-13 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #14 from Segher Boessenkool  ---
What is the exact command line (and relevant configuration!) required to
reproduce this?

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-20 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #13 from Segher Boessenkool  ---
(In reply to Alexander Grund from comment #11)
> Some more experiments with GCC 10.3, OpenBLAS 0.3.15 and FlexiBLAS 3.0.4:
> 
> Baseline: Broken at -O1, working at -Og
> 
> I got it to break with "-Og -fmove-loop-invariants".
> Then it worked again by adding "-fstack-protector-all".

Both are great info!

> But that is
> seemingly not advisable:
> https://developers.redhat.com/blog/2020/05/22/stack-clash-mitigation-in-gcc-
> part-3

-fstack-protector-strong is cheap enough that you can (and perhaps should)
enable it almost always.  Some distributions do this even?

-fstack-check= is an Ada thing.  -fstack-clash-protection is a different thing
as well (that's what that article is about).

Enabling ssp is not a great workaround of course, it is much to roundabout;
and I suspect the only reason it works is because it changes the stack layout.
Still, useful info, thanks :-)

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-20 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #12 from Segher Boessenkool  ---
(In reply to Alexander Grund from comment #10)
> (In reply to Peter Bergner from comment #2)
> > The failure with GCC 7 and later coincides with the PPC port starting to
> > default to LRA instead of reload.
> 
> Is there a compiler flag that can switch the default back as a workaround?

No, the PowerPC GCC port only supports LRA since g:7a5cbf29beb2 (from 2017).

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-20 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #11 from Alexander Grund  ---
Some more experiments with GCC 10.3, OpenBLAS 0.3.15 and FlexiBLAS 3.0.4:

Baseline: Broken at -O1, working at -Og

I got it to break with "-Og -fmove-loop-invariants".
Then it worked again by adding "-fstack-protector-all". But that is seemingly
not advisable:
https://developers.redhat.com/blog/2020/05/22/stack-clash-mitigation-in-gcc-part-3

Hence the current workaround is to use "-O2 -fno-move-loop-invariants"

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-20 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #10 from Alexander Grund  ---
(In reply to Peter Bergner from comment #2)
> The failure with GCC 7 and later coincides with the PPC port starting to
> default to LRA instead of reload.

Is there a compiler flag that can switch the default back as a workaround?

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-14 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Peter Bergner  changed:

   What|Removed |Added

   Assignee|bergner at gcc dot gnu.org |jskumari at gcc dot 
gnu.org

--- Comment #9 from Peter Bergner  ---
(In reply to Peter Bergner from comment #8)
> I'm sorry, this is still on my TODO to debug.  I have worked on this, but
> got side tracked on other things.  I'll try and refresh myself with where I
> was at and continue working this.

Actually, Surya from my team will take over looking at this.  Reassigning the
bug to her.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-08 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Peter Bergner  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |bergner at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #8 from Peter Bergner  ---
(In reply to Alexander Grund from comment #7)
> Hi, it's more than 1 year later now. Peter seemingly has a simple reproducer.
> Is there anything new on this? Any patch to fix that or at least anything to
> try or a workaround like disabling a specific optimization causing this?

I'm sorry, this is still on my TODO to debug.  I have worked on this, but got
side tracked on other things.  I'll try and refresh myself with where I was at
and continue working this.

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-07-08 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #7 from Alexander Grund  ---
Hi,
it's more than 1 year later now. Peter seemingly has a simple reproducer.
Is there anything new on this? Any patch to fix that or at least anything to
try or a workaround like disabling a specific optimization causing this?

Best Regards

[Bug target/100799] Stackoverflow in optimized code on PPC

2022-01-09 Thread kenneth.hoste at ugent dot be via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #6 from Kenneth Hoste  ---
(In reply to Segher Boessenkool from comment #3)
> Hi Alexander,
> 
> You do not say what the actual target you used is?  powerpc-linux,
> powerpc64-linux, powerpc64le-linux, something else entirely?

We're definitely seeing this on ppc64le, see also
https://github.com/mpimd-csc/flexiblas/issues/17 and
https://github.com/easybuilders/easybuild-easyconfigs/issues/12968 for
additional context.

[Bug target/100799] Stackoverflow in optimized code on PPC

2021-10-05 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #5 from Peter Bergner  ---
So I took dgebal.f and ran delta on it to try and reduce it to something
manageable (I wish creduce worked on fortran files!) and got the following
which still shows us accessing above the stack.

  SUBROUTINE DGEBAL( JOB, N, A, LDA, ILO, IHI, SCALE, INFO )
  CHARACTER  JOB
  DOUBLE PRECISION   A( LDA, * ), SCALE( * )
  LOGICALNOCONV
  140 CONTINUE
  DO 200 I = K, L
 C = DNRM2( L-K+1, A( K, I ), 1 )
 R = DNRM2( L-K+1, A( I, K ), LDA )
 ICA = IDAMAX( L, A( 1, I ), 1 )
 CA = ABS( A( ICA, I ) )
 IF( C.EQ.ZERO .OR. R.EQ.ZERO )
 $  GO TO 200
 IF( G.LT.R .OR. MAX( R, RA ).GE.SFMAX2 .OR.
 $   MIN( F, C, G, CA ).LE.SFMIN2 )GO TO 190
 F = F / SCLFAC
 G = G / SCLFAC
  190CONTINUE
 IF( ( C+R ).GE.FACTOR*S )
 $  GO TO 200
 IF( F.LT.ONE .AND. SCALE( I ).LT.ONE ) THEN
 END IF
 CALL DSCAL( N-K+1, G, A( I, K ), LDA )
  200 CONTINUE
  IF( NOCONV )
 $   GO TO 140
  END

This isn't related to some strange fortran parameter passing rules (ie, all
params are passed by reference), is it?


dgebal_:
.LFB0:
.cfi_startproc
.LCF0:
0:  addis 2,12,.TOC.-.LCF0@ha
addi 2,2,.TOC.-.LCF0@l
.localentry dgebal_,.-dgebal_
std 24,-88(1)
.cfi_offset 24, -88
lwa 24,0(6)
mflr 0
mfcr 11,8
std 20,-120(1)
std 15,-160(1)
std 16,-152(1)
std 17,-144(1)
std 19,-128(1)
std 21,-112(1)
std 22,-104(1)
std 23,-96(1)
std 25,-80(1)
std 27,-64(1)
std 28,-56(1)
stw 11,8(1)
li 9,0
.cfi_register 65, 0
.cfi_offset 20, -120
.cfi_offset 15, -160
.cfi_offset 16, -152
.cfi_offset 17, -144
.cfi_offset 19, -128
.cfi_offset 21, -112
.cfi_offset 22, -104
.cfi_offset 23, -96
.cfi_offset 25, -80
.cfi_offset 27, -64
.cfi_offset 28, -56
.cfi_offset 72, 8
addis 27,2,.LANCHOR0@toc@ha
stfd 29,-24(1)
stfd 30,-16(1)
stfd 31,-8(1)
std 14,-168(1)
std 18,-136(1)
std 26,-72(1)
std 29,-48(1)
cmpdi 0,24,0
std 0,16(1)
std 30,-40(1)
std 31,-32(1)
stdu 1,-224(1)
.cfi_def_cfa_offset 224
.cfi_offset 61, -24
.cfi_offset 62, -16
.cfi_offset 63, -8
.cfi_offset 14, -168
.cfi_offset 18, -136
.cfi_offset 26, -72
.cfi_offset 29, -48
.cfi_offset 65, 16
.cfi_offset 30, -40
.cfi_offset 31, -32
addi 27,27,.LANCHOR0@toc@l
li 21,-8
mr 25,6
isel 24,0,24,0
mr 16,4
cmpwi 4,9,0
addi 28,1,32
addi 22,1,36
addi 15,1,40
std 5,272(1)   # 272 is bigger than 224!
...

[Bug target/100799] Stackoverflow in optimized code on PPC

2021-06-01 Thread amodra at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Alan Modra  changed:

   What|Removed |Added

 Target|powerpc |powerpc64le
 CC||amodra at gmail dot com

--- Comment #4 from Alan Modra  ---
The disassembly says this is powerpc64le.  Possibly interesting fact: the
offsets used above the stack frame are 400, 432, 440, which all correspond to
the parameter save area.  I don't see any reason that DGEBAL should have a
parameter save area though since all parameters can be passed in regs.

[Bug target/100799] Stackoverflow in optimized code on PPC

2021-06-01 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #3 from Segher Boessenkool  ---
Hi Alexander,

You do not say what the actual target you used is?  powerpc-linux,
powerpc64-linux, powerpc64le-linux, something else entirely?

[Bug target/100799] Stackoverflow in optimized code on PPC

2021-06-01 Thread bergner at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

Peter Bergner  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-06-01

--- Comment #2 from Peter Bergner  ---
(In reply to Alexander Grund from comment #1)
> Confirmed to also break with GCC 7.3, 8.2, 8.3 but works with 6.3, 6.4, 6.5

The failure with GCC 7 and later coincides with the PPC port starting to
default to LRA instead of reload.  If I look at the debug dumps compiling
dgebal.f, the 440 offset to the stack is created by an LRA spill.  No problem
there that I can see.  The problem seems to come later when we generate the
prologue/epilogue and we only update the stack pointer by the smaller 368 byte
offset.

Either LRA isn't telling us it needs that extra stack space or the ppc backend
didn't notice.  I'll keep digging.

[Bug target/100799] Stackoverflow in optimized code on PPC

2021-05-28 Thread alexander.grund--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799

--- Comment #1 from Alexander Grund  ---
Confirmed to also break with GCC 7.3, 8.2, 8.3 but works with 6.3, 6.4, 6.5