On Thu, 21 May 2015 07:59:35 -0400
Andreas Kloeckner <[email protected]> wrote:

> Luke Pfister <[email protected]> writes:
> > Is there a suggested way to do the equivalent of np.sum along a particular
> > axis for a high-dimensional GPUarray?
> >
> > I saw that this was discussed in 2009, before GPUarrays carried stride
> > information.
> 
> Hand-writing a kernel is probably still your best option. Just map the
> non-reduction axes to the grid/thread block axes, and write a for loop
> to do the summation.

Won't you win by having 1 workgroup (sorry it is the OpenCL name, can't 
remember the CUDA one)
doing a partial parallel reduction ?

i.e. 1 workgroup = 32 threads
First stage:
32x( read + add) to shared memory as much as needed for the dimension of the 
gpuarray

Second stage:
Parallel reducton within the shared memory (even without barrier as we are in a 
warp)

Cheers, 
-- 
Jérôme Kieffer
tel +33 476 882 445

Attachment: signature.asc
Description: PGP signature

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to