Hey Peter, Have you looked at pycuda/reduction.py<http://git.tiker.net/pycuda.git/blob/HEAD:/pycuda/reduction.py>?
N On Fri, Aug 21, 2009 at 9:31 PM, Peter Berrington < [email protected]> wrote: > This may be a stupid question, but has anyone got any experience in > implementing a parallel prefix sum with pycuda? I am trying to replace a > portion of numpy code I had which was used to calculate the root mean square > difference between two images. Theres a few steps in this procedure: a new > array is created from the pairwise difference between the two source arrays. > Each element of this new array is squared, and finally the elements of the > new array are summed and divided by the array length in order to calculate > the average square error between the two source images. My numpy code looks > like this (hopefully the indentation is intact): > > def rmse(first, second): > """ Returns the root mean square error between two image arrays. """ > assert numpy.size(first) == numpy.size(second) > difference = numpy.diff(numpy.dstack( > (first.astype(numpy.float64), > second.astype(numpy.float64)))) > return 1 - math.sqrt(numpy.mean(difference**2)) / 255.0 > > > It's not hard to see how pycuda could be used to accomplish the first few > steps of calculating the squared pairwise difference between two source > arrays: I was thinking that I would upcast both image arrays to floats, > since I would be unable to accurately square or subtract them from each > other as 8-bit unsigned integers. Off the top of my head, the kernel might > start looking like this > > mod = SourceModule(""" > __global__ void diff_and_square(float *dest, float *a, float *b) > { > const int i = threadIdx.x; > dest[i] = a[i]*a[i] + b[i]*b[i] - (2 * a[i] * b[i]); > } > """) > > but I'm stuck on the last step: summation. > > I've read through this page on implementing parallel prefix sum with CUDA: > http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html > It's a pretty good resource for understanding how to implement a prefix > scan while avoiding bank conflicts and remaining work efficient. > Unfortunately, the source code offered for download on that page is no > longer hosted, it seems that its been replaced with similar functionality in > the CUDA Data Parallel Primitives library- I was trying to skim through this > code to see if it might be possible to plug it in and it seems quite > complicated, using a plan manager class to control the behaviour of the > prefix scan in a very genericized way. > > I was hoping someone might be able to offer some advice on the most > painless way to accomplish this with pycuda. Thanks in advance! > > _______________________________________________ > PyCUDA mailing list > [email protected] > http://tiker.net/mailman/listinfo/pycuda_tiker.net > > -- Nicolas Pinto Ph.D. Candidate, Brain & Computer Sciences Massachusetts Institute of Technology, USA http://web.mit.edu/pinto
_______________________________________________ PyCUDA mailing list [email protected] http://tiker.net/mailman/listinfo/pycuda_tiker.net
