Thomas Wiecki <thomas_wie...@brown.edu> writes:

> On Wed, May 30, 2012 at 7:28 PM, Andreas Kloeckner <kloeck...@cims.nyu.edu>
> wrote:
>> On Wed, 30 May 2012 08:10:03 -0400, Thomas Wiecki <thomas_wie...@brown.edu>
> wrote:
>>> OK, so blockDim.x*gridDim.x gives the max number of threads?
>>
>> 'max number of threads' is also bad terminology. Those would be device
>> properties. 'Total number of threads in current launch' is correct,
>> assuming it's a 1D launch.
>>
>>> I assumed
>>> for small arrays it would just be 1 in which case the for loop would
>>> be looping over the whole array.
>>>
>>> Can you elaborate on why it is said that this approach is slower than
>>> when you can guarantee that size < max_threads? In that case the for
>>> loop should only go 1 iteration.
>>
>> Sorry, I'm sure I'm being dense here--but I really don't understand what
>> the difference between 'delta' and 'max_threads' in your opinion is.
>
> Yes, you are absolutely correct. I was confused about what
> blockDim.x*gridDim.x gives but it makes sense now and they are in fact
> equivalent.
>
>> If
>> you're asking about the maximal number of threads the device can
>> support (see above), there are good reasons to do smaller launches, as
>> long as they still fill the machine. (and PyCUDA makes sure of that)
>
> What are those good reasons?

There's some (small) overhead for switching thread blocks compared to
just executing code within a block. So more blocks launched -> more of
that overhead. The point is that CUDA pretends that there's an
'infinite' number of cores, and it's up to you to choose how many of
those to use. Because of the (very slight) penalty, it's best not to
stretch the illusion of 'infinitely many cores' too far if it's not
necessary. (In fact, much of the overhead is in address computations and
such, which can be amortized if there's just a single long for loop.)

> Assuming these good reasons exist, what's the functionality in PyCUDA to do
> smaller launches to fill the machine? I assume you refer to the block and
> grid parameters. So instead of the above I write a kernel without the for
> loop and launch like this (assuming my device can launch 512 threads per
> block):

Assuming your workload is 'embarrassingly parallel', you can choose
how to use that parallelism: in a for loop, in block size, or in grid
size. What I'm talking about is just how to make a seat-of-the-pants
tradeoff between those.

> size_out = 2048
> out = gpuarray.zeros(size_out, np.float32)
> my_kernel(out, block(np.max([512, size_out]), 1, 1), grid=(size_out // 512, 
> 1))
>
> However, in my actual case I think I can't use this pattern as I am passing
>
> pycuda.curandom.XORWOWRandomNumberGenerator().state
> to the kernel. I think this stores the generators inside of shared memory.
> So using grid size > 1 would try to access generators that were not
> initialized. However, could I initialize generators on multiple grid cells
> (i.e. device memory) and use the grid approach without a for loop? Would it
> be more efficient.
>
> I obviously haven't grasped all the concepts completely so any
> clarification would be much appreciated.

Check the code in pycuda.curandom for how it's used there. I'm certain
this uses grid_size > 1, otherwise most of the machine would go unused.

Andreas 

Attachment: pgppJis2wAklq.pgp
Description: PGP signature

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to