Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
On Tue, 19 Jan 2016, Alexander Monakov wrote: > > You mean you already have implemented something along the lines I > > proposed? > > Yes, I was implementing OpenMP teams, and it made sense to add warps per block > limiting at the same time (i.e. query CU_FUNC_ATTRIBUTE_... and limit if > default or requested number of threads per team is too high). I intend to > post that patch as part of a larger series shortly (but the patch itself is > simple enough, although a small tweak will be needed to make it apply to > OpenACC too). Here's the patch I was talking about: https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=04e68c22081c36caf5da9d9f4ca5e895e1088c78;hp=735c8a7d88a7e14cb707f22286678982174175a6 Alexander
Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
On Tue, 19 Jan 2016, Thomas Schwinge wrote: > Hi! > > With nvptx offloading, in one OpenACC test case, we're running into the > following fatal error (GOMP_DEBUG=1 output): > > [...] > info: Function properties for 'LBM_performStreamCollide$_omp_fn$0': > info: used 87 registers, 0 stack, 8 bytes smem, 328 bytes cmem[0], 80 > bytes cmem[2], 0 bytes lmem > [...] > nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, > workers=32, vectors=32 > > libgomp: cuLaunchKernel error: too many resources requested for launch > > Very likely this means that the number of registers used in this function > ("used 87 registers"), multiplied by the thread block size (workers * > vectors, "workers=32, vectors=32"), exceeds the hardware maximum. Yes, today most CUDA GPUs allow 64K regs per block, some allow 32K, so 87*32*32 definitely overflows that limit. A reference is available in CUDA C Programming, appendix G, table 13: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities > (One problem certainly might be that we're currently not doing any > register allocation for nvptx, as far as I remember based on the idea > that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix > this up" for us -- which I'm not sure it actually is doing?) (well, if you want I can point out that 1) GCC never emits launch bounds so PTX JIT has to guess limits -- that's something I'd like to play with in the future, time permitting 2) OpenACC register copying at forks increases (pseudo-)register pressure 3) I think if you inspect PTX code you'll see it used way more than 87 regs) As for the proposed patch, does the OpenACC spec leave the implementation freedom to spawn a different number of workers than requested? (honest question -- I didn't look at the spec that closely) > Alternatively/additionally, we could try experimenting with using the > following of enum CUjit_option "Online compiler and linker options": [snip] > ..., to have the PTX JIT reduce the number of live registers (if > possible; I don't know), and/or could try experimenting with querying the > active device, enum CUdevice_attribute "Device properties": > > [...] > CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12 > Maximum number of 32-bit registers available per block > [...] > > ..., and use that in combination with each function's enum > CUfunction_attribute "Function properties": [snip] > ... to determine an optimal number of threads per block given the number > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > would do that already?). I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's cuOcc* (occupancy query) interface that allows to simply ask the driver about the per-function launch limit. Thanks. Alexander
Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
On 01/19/16 06:49, Thomas Schwinge wrote: (One problem certainly might be that we're currently not doing any register allocation for nvptx, as far as I remember based on the idea that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix this up" for us -- which I'm not sure it actually is doing?) My understanding is that the JIT compiler does register allocation. int axis = get_oacc_ifn_dim_arg (call); + if (axis == GOMP_DIM_WORKER) +{ + /* libgomp's nvptx plugin might potentially modify +dims[GOMP_DIM_WORKER]. */ + return NULL_TREE; +} this is almost certainly wrong. You're preventing constant folding in the compiler. nathan
Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
On Tue, 19 Jan 2016, Alexander Monakov wrote: > > ... to determine an optimal number of threads per block given the number > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > > would do that already?). > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's > cuOcc* (occupancy query) interface that allows to simply ask the driver about > the per-function launch limit. Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is indeed sufficient for limiting threads per block, which is trivially translatable into workers per gang in OpenACC. IMO it's also a cleaner approach in this case, compared to iterative backoff (if, again, the implementation is free to do that). When mentioning cuOcc* I was thinking about finding an optimal number of blocks per device, which is a different story. Alexander
Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
On Tue, 19 Jan 2016, Thomas Schwinge wrote: > Hi! > > On Tue, 19 Jan 2016 17:07:17 +0300, Alexander Monakov> wrote: > > On Tue, 19 Jan 2016, Alexander Monakov wrote: > > > > ... to determine an optimal number of threads per block given the number > > > > of registers (maybe just querying > > > > CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > > > > would do that already?). > > > > > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 > > > there's > > > cuOcc* (occupancy query) interface that allows to simply ask the driver > > > about > > > the per-function launch limit. > > You mean you already have implemented something along the lines I > proposed? Yes, I was implementing OpenMP teams, and it made sense to add warps per block limiting at the same time (i.e. query CU_FUNC_ATTRIBUTE_... and limit if default or requested number of threads per team is too high). I intend to post that patch as part of a larger series shortly (but the patch itself is simple enough, although a small tweak will be needed to make it apply to OpenACC too). Alexander
Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Hi! On Tue, 19 Jan 2016 08:47:02 -0500, Nathan Sidwellwrote: > On 01/19/16 06:49, Thomas Schwinge wrote: > > int axis = get_oacc_ifn_dim_arg (call); > > + if (axis == GOMP_DIM_WORKER) > > +{ > > + /* libgomp's nvptx plugin might potentially modify > > +dims[GOMP_DIM_WORKER]. */ > > + return NULL_TREE; > > +} > > this is almost certainly wrong. You're preventing constant folding in the > compiler. Yes, because if libgomp can modify dims[GOMP_DIM_WORKER], in the compiler we can no assume it to be constant? (Did result in a run-time test verification failure.) Of course, my hammer might be a too big one (which is why this is a RFC). Grüße Thomas
Re: [RFC] [nvptx] Try to cope with cuLaunchKernel returning CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Hi! On Tue, 19 Jan 2016 17:07:17 +0300, Alexander Monakovwrote: > On Tue, 19 Jan 2016, Alexander Monakov wrote: > > > ... to determine an optimal number of threads per block given the number > > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > > > would do that already?). > > > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 > > there's > > cuOcc* (occupancy query) interface that allows to simply ask the driver > > about > > the per-function launch limit. You mean you already have implemented something along the lines I proposed? > Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is > indeed sufficient for limiting threads per block, which is trivially > translatable into workers per gang in OpenACC. That's good to know, thanks! > IMO it's also a cleaner > approach in this case, compared to iterative backoff (if, again, the > implementation is free to do that). It is not explicitly spelled out in OpenACC 2.0a, but it got clarified in OpenACC 2.5. See "2.5.7. num workers clause": "[...] The implementation may use a different value than specified based on limitations imposed by the target architecture". > When mentioning cuOcc* I was thinking about finding an optimal number of > blocks per device, which is a different story. :-) Grüße Thomas