Re: [PyCUDA] PyCuda Memory Question

Aaron Benjamin Greenblatt Tue, 03 Nov 2009 18:07:19 -0800

Well, that's not helpful. I didn't paste the output with the C source included. 
Here it is:
x:
[[ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]]
y
[[  0.   0.  NaN   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]]
ydes
[[ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]]
weightsL1
[[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.]]
L1preadd
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]]
L1s
[  0.   0.  Inf  Inf]
L1xout
[ 0.  0.  0.  0.]
weightsL2
[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]
L2preadd
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
L2s
[ 0.  0.  0.  0.]
L2xout
[ 0.  0.  0.  0.]
weightsL3
[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]
L3preadd
[[  0.   0.   0.   0.]
 [  0.  Inf   0.   0.]
 [  0.   0.   0.   0.]
 [  0.   0.   0.   0.]]
L3s
[ 0.  0.  0.  0.]
L3xout
[ 0.  0.  0.  0.]



----- Original Message -----
From: "Aaron Greenblatt" <[email protected]>
To: [email protected]
Sent: Tuesday, November 3, 2009 11:05:52 AM GMT -08:00 US/Canada Pacific
Subject: [PyCUDA] PyCuda Memory Question

Hi,

I'm new to Python but have coded stuff in C / CUDA before.

I am trying to copy some variables from Python / Numpy to a GPU, and then back
to the host again. When I get the stuff back from the GPU, I appear to get a few
random NaN's and Inf values - I'm confused as to why these are happening. I have
a few C source modules in the Python script, and, when I remove them, some of
the Inf's go away. This confuses me even more, as I never even called the
functions in the C source modules, so removing them shouldn't make a difference.
(Or am I missing something there too?) 

It almost seems like the system / video driver is overwriting the memory that I
write on the video card. Is this a possibility and, if so, how does one deal
with it in PyCuda? (I haven't run into this issue when working on C / CUDA
before, but my dataset was also pretty small). I'm going to look through
nVidia's CUDA programming guide again to make sure that I'm not missing
something obvoius.

Also, I know that I need to optimize the code in the C modules - for now I just
want to get something working, and then I'll write C code that uses the hardware
better.

I've attached source code and output with and without the C source modules.

Does anyone have thoughts as to what's going on here? Thanks for your help!

Aaron


**** Script  without C source ***

# Sample source code from the Tutorial Introduction in the documentation.

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy

x = numpy.ones([4,5]) * .01
ydes = x
y = numpy.empty_like(x)
L1neurons = 4
L2neurons = 4
L3neurons = 4
L1weightsPerNeuron = x.size
L2weightsPerNeuron = L1neurons
L3weightsPerNeuron = L2neurons
weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron])
weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron])
weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron])
L1s = numpy.empty([L1neurons])
L2s = numpy.empty([L2neurons])
L3s = numpy.empty([L3neurons])
L1xout = numpy.empty_like(L1s)
L1PreAdd = numpy.empty_like(weightsL1)
L2xout = numpy.empty_like(L2s)
L2PreAdd = numpy.empty_like(weightsL2)
L3xout = numpy.empty_like(L3s)
L3PreAdd = numpy.empty_like(weightsL3)

# convert these variables to float singles for GPU use
x = x.astype(numpy.float32)
ydes = ydes.astype(numpy.float32)
y = y.astype(numpy.float32)
weightsL1 = weightsL1.astype(numpy.float32)
weightsL2 = weightsL2.astype(numpy.float32)
weightsL3 = weightsL3.astype(numpy.float32)
L1s = L1s.astype(numpy.float32)
L2s = L2s.astype(numpy.float32)
L3s = L3s.astype(numpy.float32)
L1PreAdd = L1PreAdd.astype(numpy.float32)
L1xout = L1xout.astype(numpy.float32)
L2PreAdd = L2PreAdd.astype(numpy.float32)
L2xout = L2xout.astype(numpy.float32)
L3PreAdd = L3PreAdd.astype(numpy.float32)
L3xout = L3xout.astype(numpy.float32)

# allocate GPU memory
GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize)
GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize)
GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize)
GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize)
GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize)
GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize)
GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize)
GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize)
GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize)
GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize)
GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize)
GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize)
GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize)
GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize)
GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize)

# copy variables to GPU
cuda.memcpy_htod(GPUx, x)
cuda.memcpy_htod(GPUydes, ydes)
cuda.memcpy_htod(GPUy, y)
cuda.memcpy_htod(GPUweightsL1, weightsL1)
cuda.memcpy_htod(GPUweightsL2, weightsL2)
cuda.memcpy_htod(GPUweightsL3, weightsL3)
cuda.memcpy_htod(GPUL1s, L1s)
cuda.memcpy_htod(GPUL2s, L2s)
cuda.memcpy_htod(GPUL3s, L3s)
cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd)
cuda.memcpy_htod(GPUL1xout, L1xout)
cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd)
cuda.memcpy_htod(GPUL2xout, L2xout)
cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd)
cuda.memcpy_htod(GPUL3xout, L3xout)

# Print stuff
cuda.memcpy_dtoh(x, GPUx)
cuda.memcpy_dtoh(ydes, GPUydes)
cuda.memcpy_dtoh(y, GPUy)
cuda.memcpy_dtoh(weightsL1, GPUweightsL1)
cuda.memcpy_dtoh(weightsL2, GPUweightsL2)
cuda.memcpy_dtoh(weightsL3, GPUweightsL3)

cuda.memcpy_dtoh(L1s, GPUL1s)
cuda.memcpy_dtoh(L2s, GPUL2s)
cuda.memcpy_dtoh(L3s, GPUL3s)
cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd)
cuda.memcpy_dtoh(L1xout, GPUL1xout)
cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd)
cuda.memcpy_dtoh(L2xout, GPUL2xout)
cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd)
cuda.memcpy_dtoh(L3xout, GPUL3xout)
print "x:"
print x
print "y"
print y
print "ydes"
print ydes
print "weightsL1"
print weightsL1
print "L1preadd"
print L1PreAdd
print "L1s"
print L1s
print "L1xout"
print L1xout
print "weightsL2"
print weightsL2
print "L2preadd"
print L2PreAdd
print "L2s"
print L2s
print "L2xout"
print L2xout
print "weightsL3"
print weightsL3
print "L3preadd"
print L3PreAdd
print "L3s"
print L3s
print "L3xout"
print L3xout

****** Output without C source *****

x:
[[ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]]
y
[[  0.   0.  NaN   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]]
ydes
[[ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]
 [ 0.01  0.01  0.01  0.01  0.01]]
weightsL1
[[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
   1.  1.]]
L1preadd
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.]]
L1s
[ 0.  0.  0.  0.]
L1xout
[ 0.  0.  0.  0.]
weightsL2
[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]
L2preadd
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
L2s
[ 0.  0.  0.  0.]
L2xout
[ 0.  0.  0.  0.]
weightsL3
[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]
L3preadd
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
L3s
[ 0.  0.  0.  0.]
L3xout
[ 0.  0.  0.  0.]


******* Script with C Source ***************

# Sample source code from the Tutorial Introduction in the documentation.

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy

x = numpy.ones([4,5]) * .01
ydes = x
y = numpy.empty_like(x)
L1neurons = 4
L2neurons = 4
L3neurons = 4
L1weightsPerNeuron = x.size
L2weightsPerNeuron = L1neurons
L3weightsPerNeuron = L2neurons
weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron])
weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron])
weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron])
L1s = numpy.empty([L1neurons])
L2s = numpy.empty([L2neurons])
L3s = numpy.empty([L3neurons])
L1xout = numpy.empty_like(L1s)
L1PreAdd = numpy.empty_like(weightsL1)
L2xout = numpy.empty_like(L2s)
L2PreAdd = numpy.empty_like(weightsL2)
L3xout = numpy.empty_like(L3s)
L3PreAdd = numpy.empty_like(weightsL3)

# convert these variables to float singles for GPU use
x = x.astype(numpy.float32)
ydes = ydes.astype(numpy.float32)
y = y.astype(numpy.float32)
weightsL1 = weightsL1.astype(numpy.float32)
weightsL2 = weightsL2.astype(numpy.float32)
weightsL3 = weightsL3.astype(numpy.float32)
L1s = L1s.astype(numpy.float32)
L2s = L2s.astype(numpy.float32)
L3s = L3s.astype(numpy.float32)
L1PreAdd = L1PreAdd.astype(numpy.float32)
L1xout = L1xout.astype(numpy.float32)
L2PreAdd = L2PreAdd.astype(numpy.float32)
L2xout = L2xout.astype(numpy.float32)
L3PreAdd = L3PreAdd.astype(numpy.float32)
L3xout = L3xout.astype(numpy.float32)

# allocate GPU memory
GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize)
GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize)
GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize)
GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize)
GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize)
GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize)
GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize)
GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize)
GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize)
GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize)
GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize)
GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize)
GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize)
GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize)
GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize)

# copy variables to GPU
cuda.memcpy_htod(GPUx, x)
cuda.memcpy_htod(GPUydes, ydes)
cuda.memcpy_htod(GPUy, y)
cuda.memcpy_htod(GPUweightsL1, weightsL1)
cuda.memcpy_htod(GPUweightsL2, weightsL2)
cuda.memcpy_htod(GPUweightsL3, weightsL3)
cuda.memcpy_htod(GPUL1s, L1s)
cuda.memcpy_htod(GPUL2s, L2s)
cuda.memcpy_htod(GPUL3s, L3s)
cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd)
cuda.memcpy_htod(GPUL1xout, L1xout)
cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd)
cuda.memcpy_htod(GPUL2xout, L2xout)
cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd)
cuda.memcpy_htod(GPUL3xout, L3xout)

# C source code for stuff we do on GPU
ForwardMult = SourceModule("""
        __global__ void layer1forward(float *x, float *weights, float *preAdd)
    {
        // this does the multiplication in the forward neural net and outputs a
pre-addition matrix     
        //initialize variables  
        int elementIdx = threadIdx.x + blockIdx.x*4;            
        int neuronIdx = blockIdx.y;
        int numweights = blockDim.x * gridDim.x;
        // do multiply
        preAdd[neuronIdx*numweights+elementIdx] = weights[neuronIdx*numweights +
elementIdx] * x[elementIdx];
    }
    """)
ForwardAdd = SourceModule("""
        __global__ void layer1forward(float *preAdd, float *s)
    {
        // this does adds together the products from forwardmult.       
        // do add
        int numweights = 20;    
        for(int i = 0; i< numweights; i++) {
                s[threadIdx.x] = s[threadIdx.x] + preAdd[numweights * 
threadIdx.x + i];
        }
    }
    """)
ForwardSigmoid = SourceModule("""
        __global__ void sigmoid(float *s, float *xout)
    {
        // this applies the sigmoid function
        xout[threadIdx.x] = (1 - exp(-2*s[threadIdx.x])) / (1 + 
exp(-2*s[threadIdx.x]));
    }
    """)

# Print stuff
cuda.memcpy_dtoh(x, GPUx)
cuda.memcpy_dtoh(ydes, GPUydes)
cuda.memcpy_dtoh(y, GPUy)
cuda.memcpy_dtoh(weightsL1, GPUweightsL1)
cuda.memcpy_dtoh(weightsL2, GPUweightsL2)
cuda.memcpy_dtoh(weightsL3, GPUweightsL3)

cuda.memcpy_dtoh(L1s, GPUL1s)
cuda.memcpy_dtoh(L2s, GPUL2s)
cuda.memcpy_dtoh(L3s, GPUL3s)
cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd)
cuda.memcpy_dtoh(L1xout, GPUL1xout)
cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd)
cuda.memcpy_dtoh(L2xout, GPUL2xout)
cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd)
cuda.memcpy_dtoh(L3xout, GPUL3xout)
print "x:"
print x
print "y"
print y
print "ydes"
print ydes
print "weightsL1"
print weightsL1
print "L1preadd"
print L1PreAdd
print "L1s"
print L1s
print "L1xout"
print L1xout
print "weightsL2"
print weightsL2
print "L2preadd"
print L2PreAdd
print "L2s"
print L2s
print "L2xout"
print L2xout
print "weightsL3"
print weightsL3
print "L3preadd"
print L3PreAdd
print "L3s"
print L3s
print "L3xout"
print L3xout

**************** Output with C source **************

# Sample source code from the Tutorial Introduction in the documentation.

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy

x = numpy.ones([4,5]) * .01
ydes = x
y = numpy.empty_like(x)
L1neurons = 4
L2neurons = 4
L3neurons = 4
L1weightsPerNeuron = x.size
L2weightsPerNeuron = L1neurons
L3weightsPerNeuron = L2neurons
weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron])
weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron])
weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron])
L1s = numpy.empty([L1neurons])
L2s = numpy.empty([L2neurons])
L3s = numpy.empty([L3neurons])
L1xout = numpy.empty_like(L1s)
L1PreAdd = numpy.empty_like(weightsL1)
L2xout = numpy.empty_like(L2s)
L2PreAdd = numpy.empty_like(weightsL2)
L3xout = numpy.empty_like(L3s)
L3PreAdd = numpy.empty_like(weightsL3)

# convert these variables to float singles for GPU use
x = x.astype(numpy.float32)
ydes = ydes.astype(numpy.float32)
y = y.astype(numpy.float32)
weightsL1 = weightsL1.astype(numpy.float32)
weightsL2 = weightsL2.astype(numpy.float32)
weightsL3 = weightsL3.astype(numpy.float32)
L1s = L1s.astype(numpy.float32)
L2s = L2s.astype(numpy.float32)
L3s = L3s.astype(numpy.float32)
L1PreAdd = L1PreAdd.astype(numpy.float32)
L1xout = L1xout.astype(numpy.float32)
L2PreAdd = L2PreAdd.astype(numpy.float32)
L2xout = L2xout.astype(numpy.float32)
L3PreAdd = L3PreAdd.astype(numpy.float32)
L3xout = L3xout.astype(numpy.float32)

# allocate GPU memory
GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize)
GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize)
GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize)
GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize)
GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize)
GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize)
GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize)
GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize)
GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize)
GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize)
GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize)
GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize)
GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize)
GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize)
GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize)

# copy variables to GPU
cuda.memcpy_htod(GPUx, x)
cuda.memcpy_htod(GPUydes, ydes)
cuda.memcpy_htod(GPUy, y)
cuda.memcpy_htod(GPUweightsL1, weightsL1)
cuda.memcpy_htod(GPUweightsL2, weightsL2)
cuda.memcpy_htod(GPUweightsL3, weightsL3)
cuda.memcpy_htod(GPUL1s, L1s)
cuda.memcpy_htod(GPUL2s, L2s)
cuda.memcpy_htod(GPUL3s, L3s)
cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd)
cuda.memcpy_htod(GPUL1xout, L1xout)
cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd)
cuda.memcpy_htod(GPUL2xout, L2xout)
cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd)
cuda.memcpy_htod(GPUL3xout, L3xout)

# C source code for stuff we do on GPU
ForwardMult = SourceModule("""
        __global__ void layer1forward(float *x, float *weights, float *preAdd)
    {
        // this does the multiplication in the forward neural net and outputs a
pre-addition matrix     
        //initialize variables  
        int elementIdx = threadIdx.x + blockIdx.x*4;            
        int neuronIdx = blockIdx.y;
        int numweights = blockDim.x * gridDim.x;
        // do multiply
        preAdd[neuronIdx*numweights+elementIdx] = weights[neuronIdx*numweights +
elementIdx] * x[elementIdx];
    }
    """)
ForwardAdd = SourceModule("""
        __global__ void layer1forward(float *preAdd, float *s)
    {
        // this does adds together the products from forwardmult.       
        // do add
        int numweights = 20;    
        for(int i = 0; i< numweights; i++) {
                s[threadIdx.x] = s[threadIdx.x] + preAdd[numweights * 
threadIdx.x + i];
        }
    }
    """)
ForwardSigmoid = SourceModule("""
        __global__ void sigmoid(float *s, float *xout)
    {
        // this applies the sigmoid function
        xout[threadIdx.x] = (1 - exp(-2*s[threadIdx.x])) / (1 + 
exp(-2*s[threadIdx.x]));
    }
    """)

# Print stuff
cuda.memcpy_dtoh(x, GPUx)
cuda.memcpy_dtoh(ydes, GPUydes)
cuda.memcpy_dtoh(y, GPUy)
cuda.memcpy_dtoh(weightsL1, GPUweightsL1)
cuda.memcpy_dtoh(weightsL2, GPUweightsL2)
cuda.memcpy_dtoh(weightsL3, GPUweightsL3)

cuda.memcpy_dtoh(L1s, GPUL1s)
cuda.memcpy_dtoh(L2s, GPUL2s)
cuda.memcpy_dtoh(L3s, GPUL3s)
cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd)
cuda.memcpy_dtoh(L1xout, GPUL1xout)
cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd)
cuda.memcpy_dtoh(L2xout, GPUL2xout)
cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd)
cuda.memcpy_dtoh(L3xout, GPUL3xout)
print "x:"
print x
print "y"
print y
print "ydes"
print ydes
print "weightsL1"
print weightsL1
print "L1preadd"
print L1PreAdd
print "L1s"
print L1s
print "L1xout"
print L1xout
print "weightsL2"
print weightsL2
print "L2preadd"
print L2PreAdd
print "L2s"
print L2s
print "L2xout"
print L2xout
print "weightsL3"
print weightsL3
print "L3preadd"
print L3PreAdd
print "L3s"
print L3s
print "L3xout"
print L3xout




_______________________________________________
PyCUDA mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

_______________________________________________
PyCUDA mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

Re: [PyCUDA] PyCuda Memory Question

Reply via email to