Re: [PyCUDA] Porting nvidia's separable convolution example to pycuda: C++ templates, loop unrolling
On Montag 15 Juni 2009, you wrote: I've attached a slightly cleaned up, standalone version with NVIDIA's copyright notice restored. I'm assuming you meant for this to end up in PyCUDA's examples folder. That's where it is now, in any case. :) Let me know if that wasn't the intenion. Thanks for the contribution, Andreas signature.asc Description: This is a digitally signed message part. ___ PyCUDA mailing list PyCUDA@tiker.net http://tiker.net/mailman/listinfo/pycuda_tiker.net
Re: [PyCUDA] Porting nvidia's separable convolution example to pycuda: C++ templates, loop unrolling
On Samstag 13 Juni 2009, Andrew Wagner wrote: Thanks again, Nicolas! I suspected that was the case... If this is still the case in the latest pycuda, the documentation for get_global() at http://documen.tician.de/pycuda/driver.html#code-on-the-device-modules-and- functions needs to be corrected. Fixed. I have attached an short demo for using constant memory that does this, and seems to work at least for one easy case. I release it under whatever license the other pycuda demo scripts are under, so it can be included in the next release if Andreas sees fit. Unfortunately, this means something harder to debug is causing my garbage results... Now in test_driver. Thanks! Andreas signature.asc Description: This is a digitally signed message part. ___ PyCUDA mailing list PyCUDA@tiker.net http://tiker.net/mailman/listinfo/pycuda_tiker.net
Re: [PyCUDA] Porting nvidia's separable convolution example to pycuda: C++ templates, loop unrolling
On Sat, Jun 13, 2009 at 6:20 PM, Nicolas Pintopi...@mit.edu wrote: Andrew, The following patch should make it work. PyCuda kernel functions take numpy.int32() whereas the grid should be int(). Thanks a lot, Nicolas! That got the kernel at least running. I'm still getting garbage output, and I think it may be because my filter kernel (filterx) is not making it into constant memory (under the identifier d_Kernel_rows). Also, pycuda.Driver.Module.get_global seems to return a length 2 tuple, while pycuda.Driver.memcpy_htod expects the reference to be an integer. I got past this error by pulling out the first entry of the tuple, which seems like the address, but I'm not sure if this is correct. This is for transferring the convolution kernel (the filter parameters, not the cuda kernel) into constant memory. The declaration of the constant array is in the kernel source at line 29 of convolution.py: __device__ __constant__ float d_Kernel_rows[KERNEL_W]; I get the address for the symbol d_Kernel_rows at line 231: d_Kernel_rows = module.get_global('d_Kernel_rows') I try to upload data to the array on line 327: cuda.memcpy_htod(d_Kernel_rows, filterx) # The kernel goes into constant memory via a symbol defined in the kernel I get the following error: The debugged program raised the exception ArgumentError Python argument types in pycuda._driver.memcpy_htod(tuple, numpy.ndarray) did not match C++ signature: memcpy_htod(unsigned int dest, boost::python::api::object src, boost::python::api::object stream=None) Here are some of the relevant variables from the debugger... d_Kernel_rows (16778496, 68) type(d_Kernel_rows[0]) type 'int' type(d_Kernel_rows[1]) type 'int' filterx array([ 0.01396019, 0.02230832, 0.03348875, 0.04722672, 0.06256524, 0.07786369, 0.09103188, 0.09997895, 0.10315263, 0.09997895, 0.09103188, 0.07786369, 0.06256524, 0.04722672, 0.03348875, 0.02230832, 0.01396019], dtype=float32) filterx.shape (17,) KERNEL_W 17 Again, I have attached a stand-alone version of the code. Thanks! import numpy #from helper_functions import * #from plotting import * import pycuda.autoinit import pycuda.driver as cuda import time import string # from database import imread, imsave, imshow # Pull out a bunch of stuff that was hard coded as pre-processor directives used by both the kernel and calling code. KERNEL_RADIUS = 8 UNROLL_INNER_LOOP = False KERNEL_W = 2 * KERNEL_RADIUS + 1 ROW_TILE_W = 128 KERNEL_RADIUS_ALIGNED = 16 COLUMN_TILE_W = 16 COLUMN_TILE_H = 48 template = ''' //24-bit multiplication is faster on G80, //but we must be sure to multiply integers //only within [-8M, 8M - 1] range #define IMUL(a, b) __mul24(a, b) // Kernel configuration #define KERNEL_RADIUS $KERNEL_RADIUS #define KERNEL_W $KERNEL_W __device__ __constant__ float d_Kernel_rows[KERNEL_W]; __device__ __constant__ float d_Kernel_columns[KERNEL_W]; // Assuming ROW_TILE_W, KERNEL_RADIUS_ALIGNED and dataW // are multiples of coalescing granularity size, // all global memory operations are coalesced in convolutionRowGPU() #defineROW_TILE_W $ROW_TILE_W #define KERNEL_RADIUS_ALIGNED $KERNEL_RADIUS_ALIGNED // Assuming COLUMN_TILE_W and dataW are multiples // of coalescing granularity size, all global memory operations // are coalesced in convolutionColumnGPU() #define COLUMN_TILE_W $COLUMN_TILE_W #define COLUMN_TILE_H $COLUMN_TILE_H''' # Ignore the ugly templated unrolling code... ''' // Loop unrolling templates, needed for best performance templateint i __device__ float convolutionRow(float *data){ return data[KERNEL_RADIUS - i] * d_Kernel[i] + convolutionRowi - 1(data); } template __device__ float convolutionRow-1(float *data){ return 0; } templateint i __device__ float convolutionColumn(float *data){ return data[(KERNEL_RADIUS - i) * COLUMN_TILE_W] * d_Kernel[i] + convolutionColumni - 1(data); } template __device__ float convolutionColumn-1(float *data){ return 0; }''' template += ''' // Row convolution filter __global__ void convolutionRowGPU( float *d_Result, float *d_Data, int dataW, int dataH ){ //Data cache __shared__ float data[KERNEL_RADIUS + ROW_TILE_W + KERNEL_RADIUS]; //Current tile and apron limits, relative to row start const int tileStart = IMUL(blockIdx.x, ROW_TILE_W); const int tileEnd = tileStart + ROW_TILE_W - 1; const intapronStart = tileStart -
Re: [PyCUDA] Porting nvidia's separable convolution example to pycuda: C++ templates, loop unrolling
Andrew, memcpy_htod is expecting a uint, not a tuple: --- convolution_original.py 2009-06-13 23:12:49.0 -0400 +++ convolution_new.py 2009-06-13 23:16:37.0 -0400 @@ -324,8 +324,8 @@ sourceImage_gpu = cuda.mem_alloc_like(sourceImage) intermediateImage_gpu = cuda.mem_alloc_like(sourceImage) cuda.memcpy_htod(sourceImage_gpu, sourceImage) -cuda.memcpy_htod(d_Kernel_rows, filterx) # The kernel goes into constant memory via a symbol defined in the kernel -cuda.memcpy_htod(d_Kernel_columns, filtery) +cuda.memcpy_htod(d_Kernel_rows[0], filterx) # The kernel goes into constant memory via a symbol defined in the kernel +cuda.memcpy_htod(d_Kernel_columns[0], filtery) # Call the kernels for convolution in each direction. blockGridRows = (iDivUp(DATA_W, ROW_TILE_W), DATA_H) blockGridColumns = (iDivUp(DATA_W, COLUMN_TILE_W), iDivUp(DATA_H, COLUMN_TILE_H)) Best, On Sat, Jun 13, 2009 at 10:16 PM, Andrew Wagner drewm1...@gmail.com wrote: On Sat, Jun 13, 2009 at 6:20 PM, Nicolas Pintopi...@mit.edu wrote: Andrew, The following patch should make it work. PyCuda kernel functions take numpy.int32() whereas the grid should be int(). Thanks a lot, Nicolas! That got the kernel at least running. I'm still getting garbage output, and I think it may be because my filter kernel (filterx) is not making it into constant memory (under the identifier d_Kernel_rows). Also, pycuda.Driver.Module.get_global seems to return a length 2 tuple, while pycuda.Driver.memcpy_htod expects the reference to be an integer. I got past this error by pulling out the first entry of the tuple, which seems like the address, but I'm not sure if this is correct. This is for transferring the convolution kernel (the filter parameters, not the cuda kernel) into constant memory. The declaration of the constant array is in the kernel source at line 29 of convolution.py: __device__ __constant__ float d_Kernel_rows[KERNEL_W]; I get the address for the symbol d_Kernel_rows at line 231: d_Kernel_rows = module.get_global('d_Kernel_rows') I try to upload data to the array on line 327: cuda.memcpy_htod(d_Kernel_rows, filterx) # The kernel goes into constant memory via a symbol defined in the kernel I get the following error: The debugged program raised the exception ArgumentError Python argument types in pycuda._driver.memcpy_htod(tuple, numpy.ndarray) did not match C++ signature: memcpy_htod(unsigned int dest, boost::python::api::object src, boost::python::api::object stream=None) Here are some of the relevant variables from the debugger... d_Kernel_rows (16778496, 68) type(d_Kernel_rows[0]) type 'int' type(d_Kernel_rows[1]) type 'int' filterx array([ 0.01396019, 0.02230832, 0.03348875, 0.04722672, 0.06256524, 0.07786369, 0.09103188, 0.09997895, 0.10315263, 0.09997895, 0.09103188, 0.07786369, 0.06256524, 0.04722672, 0.03348875, 0.02230832, 0.01396019], dtype=float32) filterx.shape (17,) KERNEL_W 17 Again, I have attached a stand-alone version of the code. Thanks! ___ PyCUDA mailing list PyCUDA@tiker.net http://tiker.net/mailman/listinfo/pycuda_tiker.net -- Nicolas Pinto Ph.D. Candidate, Brain Computer Sciences Massachusetts Institute of Technology, USA http://web.mit.edu/pinto ___ PyCUDA mailing list PyCUDA@tiker.net http://tiker.net/mailman/listinfo/pycuda_tiker.net