The code is actually found in the CUDA SDK, which I assume you downloaded
with the CUDA toolkit and CUDA driver from NVIDIA at
http://www.nvidia.com/object/cuda_get.html.
Go to your SDK_INSTALLATION_DIR/C/src/scan/
The doc subdirectory in there also contains the pdf version of the gpugems
link you posted. Also note there is a scanLargeArray directory under src for
scanning large, non power of 2 arrays. It's very good learning material
while being very useful.
bill
Peter Berrington wrote:
>
> This may be a stupid question, but has anyone got any experience in
> implementing a parallel prefix sum with pycuda? I am trying to replace a
> portion of numpy code I had which was used to calculate the root mean
> square
> difference between two images. Theres a few steps in this procedure: a new
> array is created from the pairwise difference between the two source
> arrays.
> Each element of this new array is squared, and finally the elements of the
> new array are summed and divided by the array length in order to calculate
> the average square error between the two source images. My numpy code
> looks
> like this (hopefully the indentation is intact):
>
> def rmse(first, second):
> """ Returns the root mean square error between two image arrays. """
> assert numpy.size(first) == numpy.size(second)
> difference = numpy.diff(numpy.dstack(
> (first.astype(numpy.float64),
> second.astype(numpy.float64))))
> return 1 - math.sqrt(numpy.mean(difference**2)) / 255.0
>
>
> It's not hard to see how pycuda could be used to accomplish the first few
> steps of calculating the squared pairwise difference between two source
> arrays: I was thinking that I would upcast both image arrays to floats,
> since I would be unable to accurately square or subtract them from each
> other as 8-bit unsigned integers. Off the top of my head, the kernel might
> start looking like this
>
> mod = SourceModule("""
> __global__ void diff_and_square(float *dest, float *a, float *b)
> {
> const int i = threadIdx.x;
> dest[i] = a[i]*a[i] + b[i]*b[i] - (2 * a[i] * b[i]);
> }
> """)
>
> but I'm stuck on the last step: summation.
>
> I've read through this page on implementing parallel prefix sum with CUDA:
> http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
> It's a pretty good resource for understanding how to implement a prefix
> scan
> while avoiding bank conflicts and remaining work efficient. Unfortunately,
> the source code offered for download on that page is no longer hosted, it
> seems that its been replaced with similar functionality in the CUDA Data
> Parallel Primitives library- I was trying to skim through this code to see
> if it might be possible to plug it in and it seems quite complicated,
> using
> a plan manager class to control the behaviour of the prefix scan in a very
> genericized way.
>
> I was hoping someone might be able to offer some advice on the most
> painless way to accomplish this with pycuda. Thanks in advance!
>
> _______________________________________________
> PyCUDA mailing list
> [email protected]
> http://tiker.net/mailman/listinfo/pycuda_tiker.net
>
>
--
View this message in context:
http://n2.nabble.com/Has-anyone-made-a-working-parallel-prefix-sum-scan-with-pycuda-tp3493899p3496060.html
Sent from the PyCuda mailing list archive at Nabble.com.
_______________________________________________
PyCUDA mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net