Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

2016-01-20 Thread Alexander Monakov
On Tue, 19 Jan 2016, Alexander Monakov wrote:
> > You mean you already have implemented something along the lines I
> > proposed?
> 
> Yes, I was implementing OpenMP teams, and it made sense to add warps per block
> limiting at the same time (i.e. query CU_FUNC_ATTRIBUTE_... and limit if
> default or requested number of threads per team is too high).  I intend to
> post that patch as part of a larger series shortly (but the patch itself is
> simple enough, although a small tweak will be needed to make it apply to
> OpenACC too).

Here's the patch I was talking about:
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=04e68c22081c36caf5da9d9f4ca5e895e1088c78;hp=735c8a7d88a7e14cb707f22286678982174175a6

Alexander


Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

2016-01-19 Thread Alexander Monakov
On Tue, 19 Jan 2016, Thomas Schwinge wrote:

> Hi!
> 
> With nvptx offloading, in one OpenACC test case, we're running into the
> following fatal error (GOMP_DEBUG=1 output):
> 
> [...]
> info: Function properties for 'LBM_performStreamCollide$_omp_fn$0':
> info: used 87 registers, 0 stack, 8 bytes smem, 328 bytes cmem[0], 80 
> bytes cmem[2], 0 bytes lmem
> [...]
>   nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, 
> workers=32, vectors=32
> 
> libgomp: cuLaunchKernel error: too many resources requested for launch
> 
> Very likely this means that the number of registers used in this function
> ("used 87 registers"), multiplied by the thread block size (workers *
> vectors, "workers=32, vectors=32"), exceeds the hardware maximum.

Yes, today most CUDA GPUs allow 64K regs per block, some allow 32K, so
87*32*32 definitely overflows that limit.  A reference is available in CUDA C
Programming, appendix G, table 13:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
 
> (One problem certainly might be that we're currently not doing any
> register allocation for nvptx, as far as I remember based on the idea
> that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix
> this up" for us -- which I'm not sure it actually is doing?)

(well, if you want I can point out that
 1) GCC never emits launch bounds so PTX JIT has to guess limits -- that's
 something I'd like to play with in the future, time permitting
 2) OpenACC register copying at forks increases (pseudo-)register pressure
 3) I think if you inspect PTX code you'll see it used way more than 87 regs)

As for the proposed patch, does the OpenACC spec leave the implementation
freedom to spawn a different number of workers than requested?  (honest
question -- I didn't look at the spec that closely)

> Alternatively/additionally, we could try experimenting with using the
> following of enum CUjit_option "Online compiler and linker options":
[snip]
> ..., to have the PTX JIT reduce the number of live registers (if
> possible; I don't know), and/or could try experimenting with querying the
> active device, enum CUdevice_attribute "Device properties":
> 
> [...]
> CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12
> Maximum number of 32-bit registers available per block 
> [...]
> 
> ..., and use that in combination with each function's enum
> CUfunction_attribute "Function properties":
[snip]
> ... to determine an optimal number of threads per block given the number
> of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
> would do that already?).

I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's
cuOcc* (occupancy query) interface that allows to simply ask the driver about
the per-function launch limit.

Thanks.
Alexander


Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

2016-01-19 Thread Nathan Sidwell

On 01/19/16 06:49, Thomas Schwinge wrote:


(One problem certainly might be that we're currently not doing any
register allocation for nvptx, as far as I remember based on the idea
that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix
this up" for us -- which I'm not sure it actually is doing?)


My understanding is that the JIT   compiler does register allocation.


int axis = get_oacc_ifn_dim_arg (call);
+  if (axis == GOMP_DIM_WORKER)
+{
+  /* libgomp's nvptx plugin might potentially modify
+dims[GOMP_DIM_WORKER].  */
+  return NULL_TREE;
+}


this is almost certainly wrong.   You're preventing constant folding in the 
compiler.


nathan


Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

2016-01-19 Thread Alexander Monakov
On Tue, 19 Jan 2016, Alexander Monakov wrote:
> > ... to determine an optimal number of threads per block given the number
> > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
> > would do that already?).
> 
> I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's
> cuOcc* (occupancy query) interface that allows to simply ask the driver about
> the per-function launch limit.

Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is
indeed sufficient for limiting threads per block, which is trivially
translatable into workers per gang in OpenACC.  IMO it's also a cleaner
approach in this case, compared to iterative backoff (if, again, the
implementation is free to do that).

When mentioning cuOcc* I was thinking about finding an optimal number of
blocks per device, which is a different story.

Alexander


Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

2016-01-19 Thread Alexander Monakov
On Tue, 19 Jan 2016, Thomas Schwinge wrote:

> Hi!
> 
> On Tue, 19 Jan 2016 17:07:17 +0300, Alexander Monakov  
> wrote:
> > On Tue, 19 Jan 2016, Alexander Monakov wrote:
> > > > ... to determine an optimal number of threads per block given the number
> > > > of registers (maybe just querying 
> > > > CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
> > > > would do that already?).
> > > 
> > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 
> > > there's
> > > cuOcc* (occupancy query) interface that allows to simply ask the driver 
> > > about
> > > the per-function launch limit.
> 
> You mean you already have implemented something along the lines I
> proposed?

Yes, I was implementing OpenMP teams, and it made sense to add warps per block
limiting at the same time (i.e. query CU_FUNC_ATTRIBUTE_... and limit if
default or requested number of threads per team is too high).  I intend to
post that patch as part of a larger series shortly (but the patch itself is
simple enough, although a small tweak will be needed to make it apply to
OpenACC too).

Alexander


Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

2016-01-19 Thread Thomas Schwinge
Hi!

On Tue, 19 Jan 2016 08:47:02 -0500, Nathan Sidwell  wrote:
> On 01/19/16 06:49, Thomas Schwinge wrote:
> > int axis = get_oacc_ifn_dim_arg (call);
> > +  if (axis == GOMP_DIM_WORKER)
> > +{
> > +  /* libgomp's nvptx plugin might potentially modify
> > +dims[GOMP_DIM_WORKER].  */
> > +  return NULL_TREE;
> > +}
> 
> this is almost certainly wrong.   You're preventing constant folding in the 
> compiler.

Yes, because if libgomp can modify dims[GOMP_DIM_WORKER], in the compiler
we can no assume it to be constant?  (Did result in a run-time test
verification failure.)  Of course, my hammer might be a too big one
(which is why this is a RFC).


Grüße
 Thomas


Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

2016-01-19 Thread Thomas Schwinge
Hi!

On Tue, 19 Jan 2016 17:07:17 +0300, Alexander Monakov  
wrote:
> On Tue, 19 Jan 2016, Alexander Monakov wrote:
> > > ... to determine an optimal number of threads per block given the number
> > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
> > > would do that already?).
> > 
> > I have implemented that for OpenMP offloading, but also since CUDA 6.0 
> > there's
> > cuOcc* (occupancy query) interface that allows to simply ask the driver 
> > about
> > the per-function launch limit.

You mean you already have implemented something along the lines I
proposed?

> Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is
> indeed sufficient for limiting threads per block, which is trivially
> translatable into workers per gang in OpenACC.

That's good to know, thanks!

> IMO it's also a cleaner
> approach in this case, compared to iterative backoff (if, again, the
> implementation is free to do that).

It is not explicitly spelled out in OpenACC 2.0a, but it got clarified in
OpenACC 2.5.  See "2.5.7. num workers clause": "[...]  The implementation
may use a different value than specified based on limitations imposed by
the target architecture".

> When mentioning cuOcc* I was thinking about finding an optimal number of
> blocks per device, which is a different story.

:-)


Grüße
 Thomas