Re: [PyCUDA] Porting nvidia's separable convolution example to pycuda: C++ templates, loop unrolling

2009-06-15 Thread Andreas Klöckner
On Montag 15 Juni 2009, you wrote:
 I've attached a slightly
 cleaned up, standalone version with NVIDIA's copyright notice
 restored.

I'm assuming you meant for this to end up in PyCUDA's examples folder. That's 
where it is now, in any case. :) Let me know if that wasn't the intenion.

Thanks for the contribution,
Andreas


signature.asc
Description: This is a digitally signed message part.
___
PyCUDA mailing list
PyCUDA@tiker.net
http://tiker.net/mailman/listinfo/pycuda_tiker.net


Re: [PyCUDA] Porting nvidia's separable convolution example to pycuda: C++ templates, loop unrolling

2009-06-14 Thread Andreas Klöckner
On Samstag 13 Juni 2009, Andrew Wagner wrote:
 Thanks again, Nicolas!  I suspected that was the case...

 If this is still the case in the latest pycuda, the documentation for
 get_global() at

 http://documen.tician.de/pycuda/driver.html#code-on-the-device-modules-and-
functions

 needs to be corrected.

Fixed.


 I have attached an short demo for using constant memory that does
 this, and seems to work at least for one easy case.  I release it
 under whatever license the other pycuda demo scripts are under, so it
 can be included in the next release if Andreas sees fit.

 Unfortunately, this means something harder to debug is causing my
 garbage results...

Now in test_driver.

Thanks!
Andreas


signature.asc
Description: This is a digitally signed message part.
___
PyCUDA mailing list
PyCUDA@tiker.net
http://tiker.net/mailman/listinfo/pycuda_tiker.net


Re: [PyCUDA] Porting nvidia's separable convolution example to pycuda: C++ templates, loop unrolling

2009-06-13 Thread Andrew Wagner
On Sat, Jun 13, 2009 at 6:20 PM, Nicolas Pintopi...@mit.edu wrote:
 Andrew,

 The following patch should make it work. PyCuda kernel functions take
 numpy.int32() whereas the grid should be int().

Thanks a lot, Nicolas!  That got the kernel at least running.  I'm
still getting garbage output, and I think it may be because my filter
kernel (filterx) is not making it into constant memory (under the
identifier d_Kernel_rows).

 Also, pycuda.Driver.Module.get_global seems to return a length 2
 tuple, while pycuda.Driver.memcpy_htod expects the reference to be an
 integer.  I got past this error by pulling out the first entry of the
 tuple, which seems like the address, but I'm not sure if this is
 correct.  This is for transferring the convolution kernel (the filter
 parameters, not the cuda kernel) into constant memory.

The declaration of the constant array is in the kernel source at line
29 of convolution.py:

__device__ __constant__ float d_Kernel_rows[KERNEL_W];

I get the address for the symbol d_Kernel_rows at line 231:

d_Kernel_rows = module.get_global('d_Kernel_rows')

I try to upload data to the array on line 327:

cuda.memcpy_htod(d_Kernel_rows,  filterx) # The kernel goes into
constant memory via a symbol defined in the kernel

I get the following error:

The debugged program raised the exception ArgumentError
Python argument types in pycuda._driver.memcpy_htod(tuple,
numpy.ndarray) did not match C++ signature: memcpy_htod(unsigned int
dest, boost::python::api::object src, boost::python::api::object
stream=None)

Here are some of the relevant variables from the debugger...

 d_Kernel_rows
(16778496, 68)
 type(d_Kernel_rows[0])
type 'int'
 type(d_Kernel_rows[1])
type 'int'
 filterx
array([ 0.01396019,  0.02230832,  0.03348875,  0.04722672,  0.06256524,
0.07786369,  0.09103188,  0.09997895,  0.10315263,  0.09997895,
0.09103188,  0.07786369,  0.06256524,  0.04722672,  0.03348875,
0.02230832,  0.01396019], dtype=float32)
 filterx.shape
(17,)
 KERNEL_W
17

Again, I have attached a stand-alone version of the code.

Thanks!
import numpy
#from helper_functions import *
#from plotting import *
import pycuda.autoinit
import pycuda.driver as cuda
import time
import string
# from database import imread,  imsave,  imshow

# Pull out a bunch of stuff that was hard coded as pre-processor directives used by both the kernel and calling code.
KERNEL_RADIUS = 8
UNROLL_INNER_LOOP = False
KERNEL_W = 2 * KERNEL_RADIUS + 1
ROW_TILE_W = 128
KERNEL_RADIUS_ALIGNED = 16
COLUMN_TILE_W = 16
COLUMN_TILE_H = 48
template = '''
//24-bit multiplication is faster on G80,
//but we must be sure to multiply integers
//only within [-8M, 8M - 1] range
#define IMUL(a, b) __mul24(a, b)


// Kernel configuration

#define KERNEL_RADIUS $KERNEL_RADIUS
#define KERNEL_W $KERNEL_W
__device__ __constant__ float d_Kernel_rows[KERNEL_W];
__device__ __constant__ float d_Kernel_columns[KERNEL_W];

// Assuming ROW_TILE_W, KERNEL_RADIUS_ALIGNED and dataW 
// are multiples of coalescing granularity size,
// all global memory operations are coalesced in convolutionRowGPU()
#defineROW_TILE_W  $ROW_TILE_W
#define KERNEL_RADIUS_ALIGNED  $KERNEL_RADIUS_ALIGNED

// Assuming COLUMN_TILE_W and dataW are multiples
// of coalescing granularity size, all global memory operations 
// are coalesced in convolutionColumnGPU()
#define COLUMN_TILE_W $COLUMN_TILE_W
#define COLUMN_TILE_H $COLUMN_TILE_H'''

# Ignore the ugly templated unrolling code...
'''

// Loop unrolling templates, needed for best performance

templateint i __device__ float convolutionRow(float *data){
return
data[KERNEL_RADIUS - i] * d_Kernel[i]
+ convolutionRowi - 1(data);
}

template __device__ float convolutionRow-1(float *data){
return 0;
}

templateint i __device__ float convolutionColumn(float *data){
return 
data[(KERNEL_RADIUS - i) * COLUMN_TILE_W] * d_Kernel[i]
+ convolutionColumni - 1(data);
}

template __device__ float convolutionColumn-1(float *data){
return 0;
}'''

template += '''

// Row convolution filter

__global__ void convolutionRowGPU(
float *d_Result,
float *d_Data,
int dataW,
int dataH
){
//Data cache
__shared__ float data[KERNEL_RADIUS + ROW_TILE_W + KERNEL_RADIUS];

//Current tile and apron limits, relative to row start
const int tileStart = IMUL(blockIdx.x, ROW_TILE_W);
const int   tileEnd = tileStart + ROW_TILE_W - 1;
const intapronStart = tileStart - 

Re: [PyCUDA] Porting nvidia's separable convolution example to pycuda: C++ templates, loop unrolling

2009-06-13 Thread Nicolas Pinto
Andrew,

memcpy_htod is expecting a uint, not a tuple:

--- convolution_original.py 2009-06-13 23:12:49.0 -0400
+++ convolution_new.py  2009-06-13 23:16:37.0 -0400
@@ -324,8 +324,8 @@
 sourceImage_gpu = cuda.mem_alloc_like(sourceImage)
 intermediateImage_gpu = cuda.mem_alloc_like(sourceImage)
 cuda.memcpy_htod(sourceImage_gpu, sourceImage)
-cuda.memcpy_htod(d_Kernel_rows,  filterx) # The kernel goes into
constant memory via a symbol defined in the kernel
-cuda.memcpy_htod(d_Kernel_columns,  filtery)
+cuda.memcpy_htod(d_Kernel_rows[0],  filterx) # The kernel goes into
constant memory via a symbol defined in the kernel
+cuda.memcpy_htod(d_Kernel_columns[0],  filtery)
 # Call the kernels for convolution in each direction.
 blockGridRows = (iDivUp(DATA_W, ROW_TILE_W), DATA_H)
 blockGridColumns = (iDivUp(DATA_W, COLUMN_TILE_W), iDivUp(DATA_H,
COLUMN_TILE_H))

Best,

On Sat, Jun 13, 2009 at 10:16 PM, Andrew Wagner drewm1...@gmail.com wrote:

 On Sat, Jun 13, 2009 at 6:20 PM, Nicolas Pintopi...@mit.edu wrote:
  Andrew,
 
  The following patch should make it work. PyCuda kernel functions take
  numpy.int32() whereas the grid should be int().

 Thanks a lot, Nicolas!  That got the kernel at least running.  I'm
 still getting garbage output, and I think it may be because my filter
 kernel (filterx) is not making it into constant memory (under the
 identifier d_Kernel_rows).

  Also, pycuda.Driver.Module.get_global seems to return a length 2
  tuple, while pycuda.Driver.memcpy_htod expects the reference to be an
  integer.  I got past this error by pulling out the first entry of the
  tuple, which seems like the address, but I'm not sure if this is
  correct.  This is for transferring the convolution kernel (the filter
  parameters, not the cuda kernel) into constant memory.

 The declaration of the constant array is in the kernel source at line
 29 of convolution.py:

 __device__ __constant__ float d_Kernel_rows[KERNEL_W];

 I get the address for the symbol d_Kernel_rows at line 231:

 d_Kernel_rows = module.get_global('d_Kernel_rows')

 I try to upload data to the array on line 327:

 cuda.memcpy_htod(d_Kernel_rows,  filterx) # The kernel goes into
 constant memory via a symbol defined in the kernel

 I get the following error:

 The debugged program raised the exception ArgumentError
 Python argument types in pycuda._driver.memcpy_htod(tuple,
 numpy.ndarray) did not match C++ signature: memcpy_htod(unsigned int
 dest, boost::python::api::object src, boost::python::api::object
 stream=None)

 Here are some of the relevant variables from the debugger...

  d_Kernel_rows
 (16778496, 68)
  type(d_Kernel_rows[0])
 type 'int'
  type(d_Kernel_rows[1])
 type 'int'
  filterx
 array([ 0.01396019,  0.02230832,  0.03348875,  0.04722672,  0.06256524,
0.07786369,  0.09103188,  0.09997895,  0.10315263,  0.09997895,
0.09103188,  0.07786369,  0.06256524,  0.04722672,  0.03348875,
0.02230832,  0.01396019], dtype=float32)
  filterx.shape
 (17,)
  KERNEL_W
 17

 Again, I have attached a stand-alone version of the code.

 Thanks!

 ___
 PyCUDA mailing list
 PyCUDA@tiker.net
 http://tiker.net/mailman/listinfo/pycuda_tiker.net




-- 
Nicolas Pinto
Ph.D. Candidate, Brain  Computer Sciences
Massachusetts Institute of Technology, USA
http://web.mit.edu/pinto
___
PyCUDA mailing list
PyCUDA@tiker.net
http://tiker.net/mailman/listinfo/pycuda_tiker.net