On Thu, 21 May 2015 07:59:35 -0400 Andreas Kloeckner <[email protected]> wrote:
> Luke Pfister <[email protected]> writes: > > Is there a suggested way to do the equivalent of np.sum along a particular > > axis for a high-dimensional GPUarray? > > > > I saw that this was discussed in 2009, before GPUarrays carried stride > > information. > > Hand-writing a kernel is probably still your best option. Just map the > non-reduction axes to the grid/thread block axes, and write a for loop > to do the summation. Won't you win by having 1 workgroup (sorry it is the OpenCL name, can't remember the CUDA one) doing a partial parallel reduction ? i.e. 1 workgroup = 32 threads First stage: 32x( read + add) to shared memory as much as needed for the dimension of the gpuarray Second stage: Parallel reducton within the shared memory (even without barrier as we are in a warp) Cheers, -- Jérôme Kieffer tel +33 476 882 445
signature.asc
Description: PGP signature
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
