I am experimenting with convolution kernels for 3D arrays. I have a scalar version that seems to work, but as I went about trying to reduce extra data copies, I ran into problems. I isolated the change in my Python code to two version that get the source Numpy array into a PyOpenCL array, before calling the exact same OpenCL program.
First, the data-loading that seems to work on every driver: # src is a sliced view on a larger Numpy array src = src.astype(float32, copy=True) src_dev = cl_array.to_device(clq, src) This version makes an explicit host copy to consolidate the source data into one contiguous buffer. The to_device() call would otherwise throw an exception if called on the source. Second, the version that works faster on Intel and AMD CPU drivers but shows non-determinism on the NVIDIA GPU driver: # src is a sliced view on a larger Numpy array src = src.astype(float32, copy=False) src_dev = cl_array.empty(clq, src.shape, float32) src_tmp = src_dev.map_to_host() src_tmp[...] = src[...] This version avoids the host copy because the original source is already in float32 format, and then it consolidates the data while it is copied into the host-mapped buffer. My tests have suggested that my OpenCL program is racing with the host to GPU data copy, as it seems to see the leading portion of the src_dev array filled with proper values while the trailing portion looks uninitialized. Through repeated tests with varying problem sizes, I managed to observe this on some rather small test arrays that I could inspect manually. Is there some other synchronization call that I am supposed to make when writing into a host-mapped array as above? Or does this look like a bug in the interaction between PyOpenCL and the NVIDIA OpenCL driver? Thanks, Karl _______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
