[SYSTEMML-540] Support sparse GPU conv2d as well as fix memory estimation of convolution operations
Design doc: Memory estimation of GPU operators - Since not all operator are supported on GPU, isGPUEnabled indicates whether an operation is enabled for GPU. This method doesnot take into account any memory estimates. - To simplify memory estimation logic, the methods computeOutputMemEstimate and computeIntermediateMemEstimate should return maximum of memory required for GPU and CP operators. - Additionally, these methods are guarded so that when -gpu flag is not provided, additional memory overhead due to GPU are ignored. For example: sparse-to-dense conversion on GPU. - (WIP) Every GPU operators should respect the memory returned by computeIntermediateMemEstimate (and computeOutputMemEstimate - see below point). - (WIP) Every GPU operator should create output in the same format as the corresponding CP operator. That is, computeOutputMemEstimate are consistent across both CP and GPU in terms of worst-case. - The drawback of using maximum memory (mem = Math.max(mem_gpu, mem_gpu)) are: a. GPU operator is not selected when mem_gpu < total memory available on GPU < mem b. CP operator is not selected (i.e. distributed operator compiled) when mem_cpu < driver memory budget < mem Closes #650. Project: http://git-wip-us.apache.org/repos/asf/systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/772d9302 Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/772d9302 Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/772d9302 Branch: refs/heads/master Commit: 772d9302dc196b047134ea491542d55113f52a08 Parents: a0cf8e3 Author: Niketan Pansare <[email protected]> Authored: Thu Sep 7 11:49:52 2017 -0800 Committer: Niketan Pansare <[email protected]> Committed: Thu Sep 7 12:49:52 2017 -0700 ---------------------------------------------------------------------- src/main/cpp/kernels/SystemML.cu | 19 + src/main/cpp/kernels/SystemML.ptx | 2757 +++++++++--------- .../org/apache/sysml/hops/ConvolutionOp.java | 458 ++- src/main/java/org/apache/sysml/hops/Hop.java | 63 +- .../apache/sysml/lops/ConvolutionTransform.java | 11 +- .../cp/ConvolutionCPInstruction.java | 91 +- .../gpu/ConvolutionGPUInstruction.java | 72 +- .../gpu/MatrixBuiltinGPUInstruction.java | 3 +- .../instructions/gpu/context/GPUContext.java | 6 + .../matrix/data/ConvolutionParameters.java | 25 + .../runtime/matrix/data/LibMatrixCUDA.java | 1041 +------ .../runtime/matrix/data/LibMatrixCuDNN.java | 1219 ++++++++ 12 files changed, 3229 insertions(+), 2536 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/systemml/blob/772d9302/src/main/cpp/kernels/SystemML.cu ---------------------------------------------------------------------- diff --git a/src/main/cpp/kernels/SystemML.cu b/src/main/cpp/kernels/SystemML.cu index d64d8aa..bb6482d 100644 --- a/src/main/cpp/kernels/SystemML.cu +++ b/src/main/cpp/kernels/SystemML.cu @@ -156,6 +156,25 @@ __global__ void relu_backward(double* X, double* dout, double* ret, int rlen, i } } +/** + * Performs inplace addition: ret += input + * + * @param input rhs input array allocated on the GPU + * @param ret the input and output array allocated on the GPU + * @param rlen the number of rows + * @param clen the number of columns + */ +extern "C" +__global__ void inplace_add(double* input, double* ret, int rlen, int clen) { + int tid = blockIdx.x * blockDim.x + threadIdx.x; + int ix = tid / clen; + int iy = tid % clen; + if(ix < rlen && iy < clen) { + int index = ix * clen + iy; + ret[index] += input[index]; + } +} + // Performs the operation corresponding to the DML script: // ones = matrix(1, rows=1, cols=Hout*Wout) // output = input + matrix(bias %*% ones, rows=1, cols=F*Hout*Wout)
