On Thu, Jun 7, 2012 at 11:50 AM, Andreas Kloeckner
<[email protected]>wrote:

> >> If
> >> you're asking about the maximal number of threads the device can
> >> support (see above), there are good reasons to do smaller launches, as
> >> long as they still fill the machine. (and PyCUDA makes sure of that)
> >
> > What are those good reasons?
>
> There's some (small) overhead for switching thread blocks compared to
> just executing code within a block. So more blocks launched -> more of
> that overhead. The point is that CUDA pretends that there's an
> 'infinite' number of cores, and it's up to you to choose how many of
> those to use. Because of the (very slight) penalty, it's best not to
> stretch the illusion of 'infinitely many cores' too far if it's not
> necessary. (In fact, much of the overhead is in address computations and
> such, which can be amortized if there's just a single long for loop.)
>

I see. In my case each item takes quite a while to compute so taking the
performance hit that comes with switching thread blocks is probably well
worth it.

 > Assuming these good reasons exist, what's the functionality in PyCUDA to
> do
> > smaller launches to fill the machine? I assume you refer to the block and
> > grid parameters. So instead of the above I write a kernel without the for
> > loop and launch like this (assuming my device can launch 512 threads per
> > block):
>
> Assuming your workload is 'embarrassingly parallel', you can choose
> how to use that parallelism: in a for loop, in block size, or in grid
> size. What I'm talking about is just how to make a seat-of-the-pants
> tradeoff between those.
>
> > size_out = 2048
> > out = gpuarray.zeros(size_out, np.float32)
> > my_kernel(out, block(np.max([512, size_out]), 1, 1), grid=(size_out //
> 512, 1))
> >
> > However, in my actual case I think I can't use this pattern as I am
> passing
> >
> > pycuda.curandom.XORWOWRandomNumberGenerator().state
> > to the kernel. I think this stores the generators inside of shared
> memory.
> > So using grid size > 1 would try to access generators that were not
> > initialized. However, could I initialize generators on multiple grid
> cells
> > (i.e. device memory) and use the grid approach without a for loop? Would
> it
> > be more efficient.
> >
> > I obviously haven't grasped all the concepts completely so any
> > clarification would be much appreciated.
>
> Check the code in pycuda.curandom for how it's used there. I'm certain
> this uses grid_size > 1, otherwise most of the machine would go unused.
>

I think this is the relevant call:

p.prepared_call((self.block_count, 1), (self.generators_per_block, 1, 1),
self.state, self.block_count * self.generators_per_block, seed.gpudata,
offset)
in ```XORWOWRandomNumberGenerator```. So if I read that correctly it inits
the blocks*threads generators, so the maximum number available.

It seems that calling a kernel on an array that is larger than
threads_per_block*blocks is in general safe. The idx will just scale up so
that the correct elements can be accessed and somehow the execution seems
to get serialized to use the maximum number of threads.

However, if I supply generator.state and use more threads than available,
this serializing will not work as the idx will try to access generators
outside of what's defined. I think this is what caused my problems before.

The solution it seems is to use the for loop approach and then always call
the kernel like this:

my_kernel(generator.state, out, block=(generator.generators_per_block,
1, 1), grid=(generator.block_count, 1))


That way I am sure I will never try to access uninitialized generators and
only use the for loop if I have to.

Does that make sense?

Thomas
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to