Thomas Wiecki <thomas_wie...@brown.edu> writes: > On Wed, May 30, 2012 at 7:28 PM, Andreas Kloeckner <kloeck...@cims.nyu.edu> > wrote: >> On Wed, 30 May 2012 08:10:03 -0400, Thomas Wiecki <thomas_wie...@brown.edu> > wrote: >>> OK, so blockDim.x*gridDim.x gives the max number of threads? >> >> 'max number of threads' is also bad terminology. Those would be device >> properties. 'Total number of threads in current launch' is correct, >> assuming it's a 1D launch. >> >>> I assumed >>> for small arrays it would just be 1 in which case the for loop would >>> be looping over the whole array. >>> >>> Can you elaborate on why it is said that this approach is slower than >>> when you can guarantee that size < max_threads? In that case the for >>> loop should only go 1 iteration. >> >> Sorry, I'm sure I'm being dense here--but I really don't understand what >> the difference between 'delta' and 'max_threads' in your opinion is. > > Yes, you are absolutely correct. I was confused about what > blockDim.x*gridDim.x gives but it makes sense now and they are in fact > equivalent. > >> If >> you're asking about the maximal number of threads the device can >> support (see above), there are good reasons to do smaller launches, as >> long as they still fill the machine. (and PyCUDA makes sure of that) > > What are those good reasons?
There's some (small) overhead for switching thread blocks compared to just executing code within a block. So more blocks launched -> more of that overhead. The point is that CUDA pretends that there's an 'infinite' number of cores, and it's up to you to choose how many of those to use. Because of the (very slight) penalty, it's best not to stretch the illusion of 'infinitely many cores' too far if it's not necessary. (In fact, much of the overhead is in address computations and such, which can be amortized if there's just a single long for loop.) > Assuming these good reasons exist, what's the functionality in PyCUDA to do > smaller launches to fill the machine? I assume you refer to the block and > grid parameters. So instead of the above I write a kernel without the for > loop and launch like this (assuming my device can launch 512 threads per > block): Assuming your workload is 'embarrassingly parallel', you can choose how to use that parallelism: in a for loop, in block size, or in grid size. What I'm talking about is just how to make a seat-of-the-pants tradeoff between those. > size_out = 2048 > out = gpuarray.zeros(size_out, np.float32) > my_kernel(out, block(np.max([512, size_out]), 1, 1), grid=(size_out // 512, > 1)) > > However, in my actual case I think I can't use this pattern as I am passing > > pycuda.curandom.XORWOWRandomNumberGenerator().state > to the kernel. I think this stores the generators inside of shared memory. > So using grid size > 1 would try to access generators that were not > initialized. However, could I initialize generators on multiple grid cells > (i.e. device memory) and use the grid approach without a for loop? Would it > be more efficient. > > I obviously haven't grasped all the concepts completely so any > clarification would be much appreciated. Check the code in pycuda.curandom for how it's used there. I'm certain this uses grid_size > 1, otherwise most of the machine would go unused. Andreas
pgppJis2wAklq.pgp
Description: PGP signature
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda