Hello, As far as I understand, this is somehow connected with pycuda::complex having default constructor. When you write (in reduction kernel):
__shared__ out_type sdata[BLOCK_SIZE]; and "out_type" has the default constructor, it is called from _every_ thread for _every_ value of the array. The reduction kernel starts like: __shared__ out_type sdata[BLOCK_SIZE]; ... collecting values ... sdata[tid] = <some value> which means that there is a race between the default constructor and actual initialization. I am not sure why nvcc does not complain about this (bug?), but the following things do help: 1. Using "extern __shared__ out_type sdata[]" and set the size of shared memory when preparing the kernel. or 2. Putting "__syncthreads()" between default initialization and actual initialization (not very good, since it leaves all those calls to default constructor, but still removes the symptom). I googled a bit, but could not find any actual rules about default constructors and shared memory. Best regards, Bogdan On Fri, Jan 27, 2012 at 4:19 AM, Andreas Kloeckner <[email protected]> wrote: > Hi Jesse, > > On Wed, 25 Jan 2012 14:51:05 -0800, Jesse Lu <[email protected]> wrote: >> The attached script shows highly inconsistent results (> 10% error at >> times) between the numpy and gpuarray dot products. The inconsistent >> results seem to only appear for large gpuarrays of data type complex64 or >> complex128. Any ideas on what's going on? Thanks! > > I can reproduce the issue, and I'll try to see what's behind > it. Unfortunately, I have a lot of stuff to do at the moment, so I can't > give you an ETA. Of course, I'd also appreciate any help in getting this > debugged--from anyone on the list! :) > > Andreas > > _______________________________________________ > PyCUDA mailing list > [email protected] > http://lists.tiker.net/listinfo/pycuda > _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
