This may be a stupid question, but has anyone got any experience in
implementing a parallel prefix sum with pycuda? I am trying to replace a
portion of numpy code I had which was used to calculate the root mean square
difference between two images. Theres a few steps in this procedure: a new
array is created from the pairwise difference between the two source arrays.
Each element of this new array is squared, and finally the elements of the
new array are summed and divided by the array length in order to calculate
the average square error between the two source images. My numpy code looks
like this (hopefully the indentation is intact):

def rmse(first, second):
    """ Returns the root mean square error between two image arrays. """
    assert numpy.size(first) == numpy.size(second)
    difference = numpy.diff(numpy.dstack(
        (first.astype(numpy.float64),
        second.astype(numpy.float64))))
    return 1 - math.sqrt(numpy.mean(difference**2)) / 255.0


It's not hard to see how pycuda could be used to accomplish the first few
steps of calculating the squared pairwise difference between two source
arrays: I was thinking that I would upcast both image arrays to floats,
since I would be unable to accurately square or subtract them from each
other as 8-bit unsigned integers. Off the top of my head, the kernel might
start looking like this

mod = SourceModule("""
__global__ void diff_and_square(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i]*a[i] + b[i]*b[i] - (2 * a[i] * b[i]);
}
""")

but I'm stuck on the last step: summation.

I've read through this page on implementing parallel prefix sum with CUDA:
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
It's a pretty good resource for understanding how to implement a prefix scan
while avoiding bank conflicts and remaining work efficient. Unfortunately,
the source code offered for download on that page is no longer hosted, it
seems that its been replaced with similar functionality in the CUDA Data
Parallel Primitives library- I was trying to skim through this code to see
if it might be possible to plug it in and it seems quite complicated, using
a plan manager class to control the behaviour of the prefix scan in a very
genericized way.

 I was hoping someone might be able to offer some advice on the most
painless way to accomplish this with pycuda. Thanks in advance!
_______________________________________________
PyCUDA mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

Reply via email to