Well, that's not helpful. I didn't paste the output with the C source included. Here it is: x: [[ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01]] y [[ 0. 0. NaN 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] ydes [[ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01]] weightsL1 [[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]] L1preadd [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]] L1s [ 0. 0. Inf Inf] L1xout [ 0. 0. 0. 0.] weightsL2 [[ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.]] L2preadd [[ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.]] L2s [ 0. 0. 0. 0.] L2xout [ 0. 0. 0. 0.] weightsL3 [[ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.]] L3preadd [[ 0. 0. 0. 0.] [ 0. Inf 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.]] L3s [ 0. 0. 0. 0.] L3xout [ 0. 0. 0. 0.]
----- Original Message ----- From: "Aaron Greenblatt" <[email protected]> To: [email protected] Sent: Tuesday, November 3, 2009 11:05:52 AM GMT -08:00 US/Canada Pacific Subject: [PyCUDA] PyCuda Memory Question Hi, I'm new to Python but have coded stuff in C / CUDA before. I am trying to copy some variables from Python / Numpy to a GPU, and then back to the host again. When I get the stuff back from the GPU, I appear to get a few random NaN's and Inf values - I'm confused as to why these are happening. I have a few C source modules in the Python script, and, when I remove them, some of the Inf's go away. This confuses me even more, as I never even called the functions in the C source modules, so removing them shouldn't make a difference. (Or am I missing something there too?) It almost seems like the system / video driver is overwriting the memory that I write on the video card. Is this a possibility and, if so, how does one deal with it in PyCuda? (I haven't run into this issue when working on C / CUDA before, but my dataset was also pretty small). I'm going to look through nVidia's CUDA programming guide again to make sure that I'm not missing something obvoius. Also, I know that I need to optimize the code in the C modules - for now I just want to get something working, and then I'll write C code that uses the hardware better. I've attached source code and output with and without the C source modules. Does anyone have thoughts as to what's going on here? Thanks for your help! Aaron **** Script without C source *** # Sample source code from the Tutorial Introduction in the documentation. import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy x = numpy.ones([4,5]) * .01 ydes = x y = numpy.empty_like(x) L1neurons = 4 L2neurons = 4 L3neurons = 4 L1weightsPerNeuron = x.size L2weightsPerNeuron = L1neurons L3weightsPerNeuron = L2neurons weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron]) weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron]) weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron]) L1s = numpy.empty([L1neurons]) L2s = numpy.empty([L2neurons]) L3s = numpy.empty([L3neurons]) L1xout = numpy.empty_like(L1s) L1PreAdd = numpy.empty_like(weightsL1) L2xout = numpy.empty_like(L2s) L2PreAdd = numpy.empty_like(weightsL2) L3xout = numpy.empty_like(L3s) L3PreAdd = numpy.empty_like(weightsL3) # convert these variables to float singles for GPU use x = x.astype(numpy.float32) ydes = ydes.astype(numpy.float32) y = y.astype(numpy.float32) weightsL1 = weightsL1.astype(numpy.float32) weightsL2 = weightsL2.astype(numpy.float32) weightsL3 = weightsL3.astype(numpy.float32) L1s = L1s.astype(numpy.float32) L2s = L2s.astype(numpy.float32) L3s = L3s.astype(numpy.float32) L1PreAdd = L1PreAdd.astype(numpy.float32) L1xout = L1xout.astype(numpy.float32) L2PreAdd = L2PreAdd.astype(numpy.float32) L2xout = L2xout.astype(numpy.float32) L3PreAdd = L3PreAdd.astype(numpy.float32) L3xout = L3xout.astype(numpy.float32) # allocate GPU memory GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize) GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize) GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize) GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize) GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize) GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize) GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize) GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize) GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize) GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize) GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize) GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize) GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize) GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize) GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize) # copy variables to GPU cuda.memcpy_htod(GPUx, x) cuda.memcpy_htod(GPUydes, ydes) cuda.memcpy_htod(GPUy, y) cuda.memcpy_htod(GPUweightsL1, weightsL1) cuda.memcpy_htod(GPUweightsL2, weightsL2) cuda.memcpy_htod(GPUweightsL3, weightsL3) cuda.memcpy_htod(GPUL1s, L1s) cuda.memcpy_htod(GPUL2s, L2s) cuda.memcpy_htod(GPUL3s, L3s) cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd) cuda.memcpy_htod(GPUL1xout, L1xout) cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd) cuda.memcpy_htod(GPUL2xout, L2xout) cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd) cuda.memcpy_htod(GPUL3xout, L3xout) # Print stuff cuda.memcpy_dtoh(x, GPUx) cuda.memcpy_dtoh(ydes, GPUydes) cuda.memcpy_dtoh(y, GPUy) cuda.memcpy_dtoh(weightsL1, GPUweightsL1) cuda.memcpy_dtoh(weightsL2, GPUweightsL2) cuda.memcpy_dtoh(weightsL3, GPUweightsL3) cuda.memcpy_dtoh(L1s, GPUL1s) cuda.memcpy_dtoh(L2s, GPUL2s) cuda.memcpy_dtoh(L3s, GPUL3s) cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd) cuda.memcpy_dtoh(L1xout, GPUL1xout) cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd) cuda.memcpy_dtoh(L2xout, GPUL2xout) cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd) cuda.memcpy_dtoh(L3xout, GPUL3xout) print "x:" print x print "y" print y print "ydes" print ydes print "weightsL1" print weightsL1 print "L1preadd" print L1PreAdd print "L1s" print L1s print "L1xout" print L1xout print "weightsL2" print weightsL2 print "L2preadd" print L2PreAdd print "L2s" print L2s print "L2xout" print L2xout print "weightsL3" print weightsL3 print "L3preadd" print L3PreAdd print "L3s" print L3s print "L3xout" print L3xout ****** Output without C source ***** x: [[ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01]] y [[ 0. 0. NaN 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] ydes [[ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01] [ 0.01 0.01 0.01 0.01 0.01]] weightsL1 [[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]] L1preadd [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]] L1s [ 0. 0. 0. 0.] L1xout [ 0. 0. 0. 0.] weightsL2 [[ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.]] L2preadd [[ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.]] L2s [ 0. 0. 0. 0.] L2xout [ 0. 0. 0. 0.] weightsL3 [[ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.] [ 1. 1. 1. 1.]] L3preadd [[ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.] [ 0. 0. 0. 0.]] L3s [ 0. 0. 0. 0.] L3xout [ 0. 0. 0. 0.] ******* Script with C Source *************** # Sample source code from the Tutorial Introduction in the documentation. import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy x = numpy.ones([4,5]) * .01 ydes = x y = numpy.empty_like(x) L1neurons = 4 L2neurons = 4 L3neurons = 4 L1weightsPerNeuron = x.size L2weightsPerNeuron = L1neurons L3weightsPerNeuron = L2neurons weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron]) weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron]) weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron]) L1s = numpy.empty([L1neurons]) L2s = numpy.empty([L2neurons]) L3s = numpy.empty([L3neurons]) L1xout = numpy.empty_like(L1s) L1PreAdd = numpy.empty_like(weightsL1) L2xout = numpy.empty_like(L2s) L2PreAdd = numpy.empty_like(weightsL2) L3xout = numpy.empty_like(L3s) L3PreAdd = numpy.empty_like(weightsL3) # convert these variables to float singles for GPU use x = x.astype(numpy.float32) ydes = ydes.astype(numpy.float32) y = y.astype(numpy.float32) weightsL1 = weightsL1.astype(numpy.float32) weightsL2 = weightsL2.astype(numpy.float32) weightsL3 = weightsL3.astype(numpy.float32) L1s = L1s.astype(numpy.float32) L2s = L2s.astype(numpy.float32) L3s = L3s.astype(numpy.float32) L1PreAdd = L1PreAdd.astype(numpy.float32) L1xout = L1xout.astype(numpy.float32) L2PreAdd = L2PreAdd.astype(numpy.float32) L2xout = L2xout.astype(numpy.float32) L3PreAdd = L3PreAdd.astype(numpy.float32) L3xout = L3xout.astype(numpy.float32) # allocate GPU memory GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize) GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize) GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize) GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize) GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize) GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize) GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize) GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize) GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize) GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize) GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize) GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize) GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize) GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize) GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize) # copy variables to GPU cuda.memcpy_htod(GPUx, x) cuda.memcpy_htod(GPUydes, ydes) cuda.memcpy_htod(GPUy, y) cuda.memcpy_htod(GPUweightsL1, weightsL1) cuda.memcpy_htod(GPUweightsL2, weightsL2) cuda.memcpy_htod(GPUweightsL3, weightsL3) cuda.memcpy_htod(GPUL1s, L1s) cuda.memcpy_htod(GPUL2s, L2s) cuda.memcpy_htod(GPUL3s, L3s) cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd) cuda.memcpy_htod(GPUL1xout, L1xout) cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd) cuda.memcpy_htod(GPUL2xout, L2xout) cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd) cuda.memcpy_htod(GPUL3xout, L3xout) # C source code for stuff we do on GPU ForwardMult = SourceModule(""" __global__ void layer1forward(float *x, float *weights, float *preAdd) { // this does the multiplication in the forward neural net and outputs a pre-addition matrix //initialize variables int elementIdx = threadIdx.x + blockIdx.x*4; int neuronIdx = blockIdx.y; int numweights = blockDim.x * gridDim.x; // do multiply preAdd[neuronIdx*numweights+elementIdx] = weights[neuronIdx*numweights + elementIdx] * x[elementIdx]; } """) ForwardAdd = SourceModule(""" __global__ void layer1forward(float *preAdd, float *s) { // this does adds together the products from forwardmult. // do add int numweights = 20; for(int i = 0; i< numweights; i++) { s[threadIdx.x] = s[threadIdx.x] + preAdd[numweights * threadIdx.x + i]; } } """) ForwardSigmoid = SourceModule(""" __global__ void sigmoid(float *s, float *xout) { // this applies the sigmoid function xout[threadIdx.x] = (1 - exp(-2*s[threadIdx.x])) / (1 + exp(-2*s[threadIdx.x])); } """) # Print stuff cuda.memcpy_dtoh(x, GPUx) cuda.memcpy_dtoh(ydes, GPUydes) cuda.memcpy_dtoh(y, GPUy) cuda.memcpy_dtoh(weightsL1, GPUweightsL1) cuda.memcpy_dtoh(weightsL2, GPUweightsL2) cuda.memcpy_dtoh(weightsL3, GPUweightsL3) cuda.memcpy_dtoh(L1s, GPUL1s) cuda.memcpy_dtoh(L2s, GPUL2s) cuda.memcpy_dtoh(L3s, GPUL3s) cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd) cuda.memcpy_dtoh(L1xout, GPUL1xout) cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd) cuda.memcpy_dtoh(L2xout, GPUL2xout) cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd) cuda.memcpy_dtoh(L3xout, GPUL3xout) print "x:" print x print "y" print y print "ydes" print ydes print "weightsL1" print weightsL1 print "L1preadd" print L1PreAdd print "L1s" print L1s print "L1xout" print L1xout print "weightsL2" print weightsL2 print "L2preadd" print L2PreAdd print "L2s" print L2s print "L2xout" print L2xout print "weightsL3" print weightsL3 print "L3preadd" print L3PreAdd print "L3s" print L3s print "L3xout" print L3xout **************** Output with C source ************** # Sample source code from the Tutorial Introduction in the documentation. import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy x = numpy.ones([4,5]) * .01 ydes = x y = numpy.empty_like(x) L1neurons = 4 L2neurons = 4 L3neurons = 4 L1weightsPerNeuron = x.size L2weightsPerNeuron = L1neurons L3weightsPerNeuron = L2neurons weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron]) weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron]) weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron]) L1s = numpy.empty([L1neurons]) L2s = numpy.empty([L2neurons]) L3s = numpy.empty([L3neurons]) L1xout = numpy.empty_like(L1s) L1PreAdd = numpy.empty_like(weightsL1) L2xout = numpy.empty_like(L2s) L2PreAdd = numpy.empty_like(weightsL2) L3xout = numpy.empty_like(L3s) L3PreAdd = numpy.empty_like(weightsL3) # convert these variables to float singles for GPU use x = x.astype(numpy.float32) ydes = ydes.astype(numpy.float32) y = y.astype(numpy.float32) weightsL1 = weightsL1.astype(numpy.float32) weightsL2 = weightsL2.astype(numpy.float32) weightsL3 = weightsL3.astype(numpy.float32) L1s = L1s.astype(numpy.float32) L2s = L2s.astype(numpy.float32) L3s = L3s.astype(numpy.float32) L1PreAdd = L1PreAdd.astype(numpy.float32) L1xout = L1xout.astype(numpy.float32) L2PreAdd = L2PreAdd.astype(numpy.float32) L2xout = L2xout.astype(numpy.float32) L3PreAdd = L3PreAdd.astype(numpy.float32) L3xout = L3xout.astype(numpy.float32) # allocate GPU memory GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize) GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize) GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize) GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize) GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize) GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize) GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize) GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize) GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize) GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize) GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize) GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize) GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize) GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize) GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize) # copy variables to GPU cuda.memcpy_htod(GPUx, x) cuda.memcpy_htod(GPUydes, ydes) cuda.memcpy_htod(GPUy, y) cuda.memcpy_htod(GPUweightsL1, weightsL1) cuda.memcpy_htod(GPUweightsL2, weightsL2) cuda.memcpy_htod(GPUweightsL3, weightsL3) cuda.memcpy_htod(GPUL1s, L1s) cuda.memcpy_htod(GPUL2s, L2s) cuda.memcpy_htod(GPUL3s, L3s) cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd) cuda.memcpy_htod(GPUL1xout, L1xout) cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd) cuda.memcpy_htod(GPUL2xout, L2xout) cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd) cuda.memcpy_htod(GPUL3xout, L3xout) # C source code for stuff we do on GPU ForwardMult = SourceModule(""" __global__ void layer1forward(float *x, float *weights, float *preAdd) { // this does the multiplication in the forward neural net and outputs a pre-addition matrix //initialize variables int elementIdx = threadIdx.x + blockIdx.x*4; int neuronIdx = blockIdx.y; int numweights = blockDim.x * gridDim.x; // do multiply preAdd[neuronIdx*numweights+elementIdx] = weights[neuronIdx*numweights + elementIdx] * x[elementIdx]; } """) ForwardAdd = SourceModule(""" __global__ void layer1forward(float *preAdd, float *s) { // this does adds together the products from forwardmult. // do add int numweights = 20; for(int i = 0; i< numweights; i++) { s[threadIdx.x] = s[threadIdx.x] + preAdd[numweights * threadIdx.x + i]; } } """) ForwardSigmoid = SourceModule(""" __global__ void sigmoid(float *s, float *xout) { // this applies the sigmoid function xout[threadIdx.x] = (1 - exp(-2*s[threadIdx.x])) / (1 + exp(-2*s[threadIdx.x])); } """) # Print stuff cuda.memcpy_dtoh(x, GPUx) cuda.memcpy_dtoh(ydes, GPUydes) cuda.memcpy_dtoh(y, GPUy) cuda.memcpy_dtoh(weightsL1, GPUweightsL1) cuda.memcpy_dtoh(weightsL2, GPUweightsL2) cuda.memcpy_dtoh(weightsL3, GPUweightsL3) cuda.memcpy_dtoh(L1s, GPUL1s) cuda.memcpy_dtoh(L2s, GPUL2s) cuda.memcpy_dtoh(L3s, GPUL3s) cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd) cuda.memcpy_dtoh(L1xout, GPUL1xout) cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd) cuda.memcpy_dtoh(L2xout, GPUL2xout) cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd) cuda.memcpy_dtoh(L3xout, GPUL3xout) print "x:" print x print "y" print y print "ydes" print ydes print "weightsL1" print weightsL1 print "L1preadd" print L1PreAdd print "L1s" print L1s print "L1xout" print L1xout print "weightsL2" print weightsL2 print "L2preadd" print L2PreAdd print "L2s" print L2s print "L2xout" print L2xout print "weightsL3" print weightsL3 print "L3preadd" print L3PreAdd print "L3s" print L3s print "L3xout" print L3xout _______________________________________________ PyCUDA mailing list [email protected] http://tiker.net/mailman/listinfo/pycuda_tiker.net _______________________________________________ PyCUDA mailing list [email protected] http://tiker.net/mailman/listinfo/pycuda_tiker.net
