On Wed, May 30, 2012 at 7:28 PM, Andreas Kloeckner <[email protected]> wrote: > On Wed, 30 May 2012 08:10:03 -0400, Thomas Wiecki <[email protected]> wrote: >> OK, so blockDim.x*gridDim.x gives the max number of threads? > > 'max number of threads' is also bad terminology. Those would be device > properties. 'Total number of threads in current launch' is correct, > assuming it's a 1D launch. > >> I assumed >> for small arrays it would just be 1 in which case the for loop would >> be looping over the whole array. >> >> Can you elaborate on why it is said that this approach is slower than >> when you can guarantee that size < max_threads? In that case the for >> loop should only go 1 iteration. > > Sorry, I'm sure I'm being dense here--but I really don't understand what > the difference between 'delta' and 'max_threads' in your opinion is.
Yes, you are absolutely correct. I was confused about what blockDim.x*gridDim.x gives but it makes sense now and they are in fact equivalent. > If > you're asking about the maximal number of threads the device can > support (see above), there are good reasons to do smaller launches, as > long as they still fill the machine. (and PyCUDA makes sure of that) What are those good reasons? Assuming these good reasons exist, what's the functionality in PyCUDA to do smaller launches to fill the machine? I assume you refer to the block and grid parameters. So instead of the above I write a kernel without the for loop and launch like this (assuming my device can launch 512 threads per block): size_out = 2048 out = gpuarray.zeros(size_out, np.float32) my_kernel(out, block(np.max([512, size_out]), 1, 1), grid=(size_out // 512, 1)) However, in my actual case I think I can't use this pattern as I am passing pycuda.curandom.XORWOWRandomNumberGenerator().state to the kernel. I think this stores the generators inside of shared memory. So using grid size > 1 would try to access generators that were not initialized. However, could I initialize generators on multiple grid cells (i.e. device memory) and use the grid approach without a for loop? Would it be more efficient. I obviously haven't grasped all the concepts completely so any clarification would be much appreciated. Thanks, Thomas
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
