Re: nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)

Jerry D via Gcc-patches Fri, 23 Dec 2022 13:23:57 -0800

On 12/23/22 6:08 AM, Thomas Schwinge wrote:

Hi!


On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fort...@gcc.gnu.org> 
wrote:

On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <tho...@codesourcery.com> wrote:

For example, for Fortran code like:

     write (*,*) "Hello world"

..., 'gfortran' creates:

     struct __st_parameter_dt dt_parm.0;

     try
       {
         dt_parm.0.common.filename = 
&"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz: 
1};
         dt_parm.0.common.line = 29;
         dt_parm.0.common.flags = 128;
         dt_parm.0.common.unit = 6;
         _gfortran_st_write (&dt_parm.0);
         _gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb: 
1 sz: 1}, 11);
         _gfortran_st_write_done (&dt_parm.0);
       }
     finally
       {
         dt_parm.0 = {CLOBBER(eol)};
       }

The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
really! -- there's a lot of state in Fortran I/O apparently).  That's a
problem for GPU execution -- here: OpenACC/nvptx -- where typically you
have small stacks.  (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
"Use custom stacks instead of local memory for automatic storage".)

Now, the Nvidia Driver tries to accomodate for such largish stack usage,
and dynamically increases the per-thread stack as necessary (thereby
potentially reducing parallelism) -- if it manages to understand the call
graph.  In case of libgfortran I/O, it evidently doesn't.  Not being able
to disprove existance of recursion is the common problem, as I've read.
At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:

     warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be 
statically determined

That's still not an actual problem: if the GPU kernel's stack usage still
fits into 1 KiB.  Very often it does, but if, as happens in libgfortran
I/O handling, there is another such 'dt_parm' put onto the stack, the
stack then overflows; device-side SIGSEGV.

(There is, by the way, some similar analysis by Tom de Vries in
<https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
Recursive tests may fail due to thread stack limit".)

Of course, you shouldn't really be doing I/O in GPU kernels, but people
do like their occasional "'printf' debugging", so we ought to make that
work (... without pessimizing any "normal" code).

I assume that generally reducing the size of 'dt_parm' etc. is out of
scope.

There are so many wiggles and turns and corner cases and the like ofnightmares in I/O I would advise not trying to reduce the dt_parm. Itcould probably be done.

For debugging GPU, would it not be better to have a way you signal backto a main thread to do a print from there, like some sort of call backin the users code under test.

Putting this another way, recommend users debugging to use a differentmethod than embedding print statements for debugging rather than do atone of work to enable something that is not really a legitimate use case.


FWIW,

Jerry

Re: nvptx: '-mframe-malloc-threshold', '-Wframe-malloc-threshold' (was: Handling of large stack objects in GPU code generation -- maybe transform into heap allocation?)

Reply via email to