I am experimenting with convolution kernels for 3D arrays.  I have a
scalar version that seems to work, but as I went about trying to
reduce extra data copies, I ran into problems. I isolated the change
in my Python code to two version that get the source Numpy array into
a PyOpenCL array, before calling the exact same OpenCL program.

First, the data-loading that seems to work on every driver:

   # src is a sliced view on a larger Numpy array
   src = src.astype(float32, copy=True)
   src_dev = cl_array.to_device(clq, src)

This version makes an explicit host copy to consolidate the source
data into one contiguous buffer.  The to_device() call would otherwise
throw an exception if called on the source.

Second, the version that works faster on Intel and AMD CPU drivers but
shows non-determinism on the NVIDIA GPU driver:

   # src is a sliced view on a larger Numpy array
   src = src.astype(float32, copy=False)
   src_dev = cl_array.empty(clq, src.shape, float32)
   src_tmp = src_dev.map_to_host()
   src_tmp[...] = src[...]

This version avoids the host copy because the original source is
already in float32 format, and then it consolidates the data while it
is copied into the host-mapped buffer.

My tests have suggested that my OpenCL program is racing with the host
to GPU data copy, as it seems to see the leading portion of the
src_dev array filled with proper values while the trailing portion
looks uninitialized.  Through repeated tests with varying problem
sizes, I managed to observe this on some rather small test arrays that
I could inspect manually.

Is there some other synchronization call that I am supposed to make
when writing into a host-mapped array as above?  Or does this look
like a bug in the interaction between PyOpenCL and the NVIDIA OpenCL
driver?

Thanks,

Karl


_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to