Greetings. I am developing a supervised learning neural network simulator with complex weights (MLMVN) and am attempting to parallelize the underlying linear algebra with pycuda. Thus far I've managed to implement all required functions using the GPUArray class, the pycuda.cumath module, and the scikits.cuda library without writing a single custom kernel, but there's a significant caveat. Because the topology of the network (# of layers, # of neurons per layer) is generated dynamically, the simulator must be able to routinely create a variable number of 2d arrays (with varying shapes) that contain the weights for each layer. Consequentially, what I need is an array of arrays, where each subarray has dimensions that are specific to the layer it represents. If I were implementing this in numpy, this would be trivial: create a 1d array with dtype=object and shape=(# of layers, ), and then assign the 2d array of weights for each layer to the corresponding element in the 1d array. This would be similar to a Matlab cell object or a C# jagged array, as the internal dimensions of the array are not all the same. Because the pycuda.gpuarray class doesn't support element assignments, creating a device-side container analogous to the described numpy array isn't possible. I've tried constructing a "jagged" numpy array and simply using gpuarray.to_gpu(numpy_array), but that does not work and seems to "confuse" my graphics card. The only solution I've been able to find is to allocate a 1d numpy array of objects as before, but then iteratively assign separate GPUArrays as each element to represent the weight arrays for each layer. In other words, each element of the 1d numpy array is a pointer to a GPUArray on the graphics card. There is significant overhead (~ 1 order of magnitude) in accessing each GPUArray compared to accessing a numpy array stored on the host machine, but I assumed that this would be a non-issue since the host code wouldn't be modifying those GPUArrays anyway, just referring them to the pycuda.cumath functions and gpuarray operators. This assumption appears to be incorrect - the GPU simulator runs extremely slowly, and its performance only deteriorates with increasing sizes of learning sets. On the bright side, it can (and has) converge(d). My conclusion is that the device and host are constantly swapping data during the simulation, and I suspect my method for storing the weights of each layer is to blame.
So my question is this: does referencing a GPUArray from within a numpy array of objects entail some kind of ungodly overhead, and is there a *good* way to store a "jagged" GPUArray? If anyone is willing to help me through this issue, I will be grateful. Source code will be provided upon request. Apologies for the length and, no-doubt, plethora of mistakes made in this posting. CH
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
