Repository: incubator-systemml Updated Branches: refs/heads/master ed3a15882 -> 3757995b5
Upgraded to use jcuda8 (from the maven repo) Closes #291 Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/3757995b Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/3757995b Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/3757995b Branch: refs/heads/master Commit: 3757995b50aef019b0ce22d9ae93eae42aed02b4 Parents: ed3a158 Author: Nakul Jindal <naku...@gmail.com> Authored: Fri Mar 3 18:11:45 2017 -0800 Committer: Nakul Jindal <naku...@gmail.com> Committed: Fri Mar 3 18:11:46 2017 -0800 ---------------------------------------------------------------------- docs/devdocs/gpu-backend.md | 61 +++--- pom.xml | 195 +++++++++++++++---- .../runtime/matrix/data/LibMatrixCUDA.java | 19 +- 3 files changed, 195 insertions(+), 80 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/3757995b/docs/devdocs/gpu-backend.md ---------------------------------------------------------------------- diff --git a/docs/devdocs/gpu-backend.md b/docs/devdocs/gpu-backend.md index c6f66d6..40311c7 100644 --- a/docs/devdocs/gpu-backend.md +++ b/docs/devdocs/gpu-backend.md @@ -19,52 +19,43 @@ limitations under the License. # Initial prototype for GPU backend -A GPU backend implements two important abstract classes: +The GPU backend implements two important abstract classes: 1. `org.apache.sysml.runtime.controlprogram.context.GPUContext` 2. `org.apache.sysml.runtime.controlprogram.context.GPUObject` -The GPUContext is responsible for GPU memory management and initialization/destruction of Cuda handles. +The `GPUContext` is responsible for GPU memory management and initialization/destruction of Cuda handles. +Currently, an active instance of the `GPUContext` class is made available globally and is used to store handles +of the allocated blocks on the GPU. A count is kept per block for the number of instructions that need it. +When the count is 0, the block may be evicted on a call to `GPUObject.evict()`. -A GPUObject (like RDDObject and BroadcastObject) is stored in CacheableData object. It gets call-backs from SystemML's bufferpool on following methods +A `GPUObject` (like RDDObject and BroadcastObject) is stored in CacheableData object. It gets call-backs from SystemML's bufferpool on following methods 1. void acquireDeviceRead() -2. void acquireDenseDeviceModify(int numElemsToAllocate) -3. void acquireHostRead() -4. void acquireHostModify() -5. void release(boolean isGPUCopyModified) +2. void acquireDeviceModifyDense() +3. void acquireDeviceModifySparse +4. void acquireHostRead() +5. void acquireHostModify() +6. void releaseInput() +7. void releaseOutput() -## JCudaContext: -The current prototype supports Nvidia's CUDA libraries using JCuda wrapper. The implementation for the above classes can be found in: -1. `org.apache.sysml.runtime.controlprogram.context.JCudaContext` -2. `org.apache.sysml.runtime.controlprogram.context.JCudaObject` +Sparse matrices on GPU are represented in `CSR` format. In the SystemML runtime, they are represented in `MCSR` or modified `CSR` format. +A conversion cost is incurred when sparse matrices are sent back and forth between host and device memory. -### Setup instructions for JCudaContext: +Concrete classes `JCudaContext` and `JCudaObject` (which extend `GPUContext` & `GPUObject` respectively) contain references to `org.jcuda.*`. -1. Follow the instructions from `https://developer.nvidia.com/cuda-downloads` and install CUDA 7.5. -2. Follow the instructions from `https://developer.nvidia.com/cudnn` and install CuDNN v4. -3. Download install JCuda binaries version 0.7.5b and JCudnn version 0.7.5. Easiest option would be to use mavenized jcuda: -```python -git clone https://github.com/MysterionRise/mavenized-jcuda.git -mvn -Djcuda.version=0.7.5b -Djcudnn.version=0.7.5 clean package -CURR_DIR=`pwd` -JCUDA_PATH=$CURR_DIR"/target/lib/" -JAR_PATH="." -for j in `ls $JCUDA_PATH/*.jar` -do - JAR_PATH=$JAR_PATH":"$j -done -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JCUDA_PATH -``` +The `LibMatrixCUDA` class contains methods to invoke CUDA libraries (where available) and invoke custom kernels. +Runtime classes (that extend `GPUInstruction`) redirect calls to functions in this class. +Some functions in `LibMatrixCUDA` need finer control over GPU memory management primitives. These are provided by `JCudaObject`. + +### Setup instructions: -Note for Windows users: -* CuDNN v4 is available to download: `http://developer.download.nvidia.com/compute/redist/cudnn/v4/cudnn-7.0-win-x64-v4.0-prod.zip` -* If above steps doesn't work for JCuda, copy the DLLs into C:\lib (or /lib) directory. +1. Follow the instructions from `https://developer.nvidia.com/cuda-downloads` and install CUDA 8.0. +2. Follow the instructions from `https://developer.nvidia.com/cudnn` and install CuDNN v5.1. -To use SystemML's GPU backend, +To use SystemML's GPU backend when using the jar or uber-jar 1. Add JCuda's jar into the classpath. -2. Include CUDA, CuDNN and JCuda's libraries in LD_LIBRARY_PATH (or using -Djava.library.path). -3. Use `-gpu` flag. +2. Use `-gpu` flag. For example: to use GPU backend in standalone mode: -```python -java -classpath $JAR_PATH:systemml-0.10.0-incubating-SNAPSHOT-standalone.jar org.apache.sysml.api.DMLScript -f MyDML.dml -gpu -exec singlenode ... +```bash +java -classpath $JAR_PATH:systemml-0.14.0-incubating-SNAPSHOT-standalone.jar org.apache.sysml.api.DMLScript -f MyDML.dml -gpu -exec singlenode ... ``` http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/3757995b/pom.xml ---------------------------------------------------------------------- diff --git a/pom.xml b/pom.xml index e044c05..f54b058 100644 --- a/pom.xml +++ b/pom.xml @@ -71,10 +71,12 @@ <scala.test.version>2.2.6</scala.test.version> <maven.build.timestamp.format>yyyy-MM-dd HH:mm:ss z</maven.build.timestamp.format> <enableGPU>false</enableGPU> + <jcuda.scope>provided</jcuda.scope> + <jcuda.version>0.8.0</jcuda.version> <!-- OS-specific JVM arguments for running integration tests --> <integrationTestExtraJVMArgs /> </properties> - + <repositories> <repository> <id>central</id> @@ -83,14 +85,6 @@ <enabled>true</enabled> </releases> </repository> - <repository> - <id>mavenized-jcuda-mvn-repo</id> - <url>https://raw.github.com/niketanpansare/mavenized-jcuda/mvn-repo/</url> - <snapshots> - <enabled>true</enabled> - <updatePolicy>always</updatePolicy> - </snapshots> - </repository> </repositories> <build> @@ -169,6 +163,13 @@ <goals> <goal>shade</goal> </goals> + <configuration> + <artifactSet> + <!--<excludes> + <exclude>org.jcuda:*</exclude> + </excludes>--> + </artifactSet> + </configuration> </execution> </executions> @@ -568,6 +569,60 @@ </build> <profiles> + + <profile> + <id>windows-x86_64</id> + <activation> + <os> + <family>windows</family> + <arch>amd64</arch> + </os> + </activation> + <properties> + <jcuda.os>windows</jcuda.os> + <jcuda.arch>x86_64</jcuda.arch> + </properties> + </profile> + <profile> + <id>linux-x86_64</id> + <activation> + <os> + <family>unix</family> + <arch>amd64</arch> + </os> + </activation> + <properties> + <jcuda.os>linux</jcuda.os> + <jcuda.arch>x86_64</jcuda.arch> + </properties> + </profile> + <profile> + <id>apple-x86_64</id> + <activation> + <os> + <family>mac</family> + <arch>x86_64</arch> + </os> + </activation> + <properties> + <jcuda.os>apple</jcuda.os> + <jcuda.arch>x86_64</jcuda.arch> + </properties> + </profile> + <profile> + <id>linux-ppc_64</id> + <activation> + <os> + <family>unix</family> + <arch>ppc64le</arch> + </os> + </activation> + <properties> + <jcuda.os>linux</jcuda.os> + <jcuda.arch>ppc_64</jcuda.arch> + </properties> + </profile> + <profile> <id>scala-2.10</id> <properties> @@ -575,7 +630,7 @@ <scala.binary.version>2.10</scala.binary.version> </properties> </profile> - + <profile> <id>scala-2.11</id> <properties> @@ -811,7 +866,7 @@ </execution> </executions> </plugin> - + <plugin> <artifactId>maven-gpg-plugin</artifactId> <version>1.6</version> @@ -1032,50 +1087,112 @@ <dependencies> - - <!-- For GPU backend - Use org.mystic:mavenized-jcuda until Alan puts org.jcuda:* - --> - <dependency> - <groupId>org.mystic</groupId> - <artifactId>mavenized-jcuda</artifactId> - <version>0.7.5b</version> - <type>jar</type> - <scope>provided</scope> - <exclusions> - <exclusion> - <groupId>*</groupId> - <artifactId>*</artifactId> - </exclusion> - </exclusions> - </dependency> - <!-- Since there is no mvn repo for jcuda + <dependency> <groupId>org.jcuda</groupId> <artifactId>jcuda</artifactId> - <version>0.7.5b</version> - <scope>provided</scope> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> </dependency> <dependency> <groupId>org.jcuda</groupId> <artifactId>jcublas</artifactId> - <version>0.7.5b</version> - <scope>provided</scope> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcufft</artifactId> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> </dependency> <dependency> <groupId>org.jcuda</groupId> <artifactId>jcusparse</artifactId> - <version>0.7.5b</version> - <scope>provided</scope> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcusolver</artifactId> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcurand</artifactId> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jnvgraph</artifactId> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> </dependency> <dependency> <groupId>org.jcuda</groupId> <artifactId>jcudnn</artifactId> - <version>0.7.5</version> - <scope>provided</scope> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcuda-natives</artifactId> + <classifier>${jcuda.os}-${jcuda.arch}</classifier> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcublas-natives</artifactId> + <classifier>${jcuda.os}-${jcuda.arch}</classifier> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcufft-natives</artifactId> + <classifier>${jcuda.os}-${jcuda.arch}</classifier> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcusparse-natives</artifactId> + <classifier>${jcuda.os}-${jcuda.arch}</classifier> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcusolver-natives</artifactId> + <classifier>${jcuda.os}-${jcuda.arch}</classifier> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcurand-natives</artifactId> + <classifier>${jcuda.os}-${jcuda.arch}</classifier> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jnvgraph-natives</artifactId> + <classifier>${jcuda.os}-${jcuda.arch}</classifier> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> + </dependency> + <dependency> + <groupId>org.jcuda</groupId> + <artifactId>jcudnn-natives</artifactId> + <classifier>${jcuda.os}-${jcuda.arch}</classifier> + <version>${jcuda.version}</version> + <scope>${jcuda.scope}</scope> </dependency> - --> - <!-- ************************* --> <dependency> <groupId>org.apache.spark</groupId> http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/3757995b/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixCUDA.java ---------------------------------------------------------------------- diff --git a/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixCUDA.java b/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixCUDA.java index 31ec348..51a0f6b 100644 --- a/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixCUDA.java +++ b/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixCUDA.java @@ -25,6 +25,7 @@ import static jcuda.jcudnn.JCudnn.cudnnActivationForward; import static jcuda.jcudnn.JCudnn.cudnnConvolutionBackwardData; import static jcuda.jcudnn.JCudnn.cudnnConvolutionBackwardFilter; import static jcuda.jcudnn.JCudnn.cudnnConvolutionForward; +import static jcuda.jcudnn.JCudnn.cudnnCreateActivationDescriptor; import static jcuda.jcudnn.JCudnn.cudnnCreateConvolutionDescriptor; import static jcuda.jcudnn.JCudnn.cudnnCreateFilterDescriptor; import static jcuda.jcudnn.JCudnn.cudnnCreatePoolingDescriptor; @@ -38,12 +39,14 @@ import static jcuda.jcudnn.JCudnn.cudnnGetConvolutionBackwardFilterWorkspaceSize import static jcuda.jcudnn.JCudnn.cudnnGetConvolutionForwardWorkspaceSize; import static jcuda.jcudnn.JCudnn.cudnnPoolingBackward; import static jcuda.jcudnn.JCudnn.cudnnPoolingForward; +import static jcuda.jcudnn.JCudnn.cudnnSetActivationDescriptor; import static jcuda.jcudnn.JCudnn.cudnnSetConvolution2dDescriptor; import static jcuda.jcudnn.JCudnn.cudnnSetFilter4dDescriptor; import static jcuda.jcudnn.JCudnn.cudnnSetPooling2dDescriptor; import static jcuda.jcudnn.JCudnn.cudnnSetTensor4dDescriptor; import static jcuda.jcudnn.cudnnConvolutionMode.CUDNN_CROSS_CORRELATION; import static jcuda.jcudnn.cudnnDataType.CUDNN_DATA_DOUBLE; +import static jcuda.jcudnn.cudnnNanPropagation.CUDNN_PROPAGATE_NAN; import static jcuda.jcudnn.cudnnPoolingMode.CUDNN_POOLING_MAX; import static jcuda.jcudnn.cudnnTensorFormat.CUDNN_TENSOR_NCHW; import static jcuda.jcusparse.JCusparse.cusparseDcsrgemm; @@ -75,6 +78,7 @@ import jcuda.jcublas.JCublas2; import jcuda.jcublas.cublasFillMode; import jcuda.jcublas.cublasHandle; import jcuda.jcublas.cublasOperation; +import jcuda.jcudnn.cudnnActivationDescriptor; import jcuda.jcudnn.cudnnConvolutionDescriptor; import jcuda.jcudnn.cudnnConvolutionFwdPreference; import jcuda.jcudnn.cudnnFilterDescriptor; @@ -268,7 +272,7 @@ public class LibMatrixCUDA { private static cudnnFilterDescriptor allocateFilterDescriptor(int K, int C, int R, int S) { cudnnFilterDescriptor filterDesc = new cudnnFilterDescriptor(); cudnnCreateFilterDescriptor(filterDesc); - cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_DOUBLE, K, C, R, S); + cudnnSetFilter4dDescriptor(filterDesc, CUDNN_DATA_DOUBLE, CUDNN_TENSOR_NCHW, K, C, R, S); return filterDesc; } @@ -285,7 +289,7 @@ public class LibMatrixCUDA { private static cudnnPoolingDescriptor allocatePoolingDescriptor(int R, int S, int pad_h, int pad_w, int stride_h, int stride_w) { cudnnPoolingDescriptor poolingDesc = new cudnnPoolingDescriptor(); cudnnCreatePoolingDescriptor(poolingDesc); - cudnnSetPooling2dDescriptor(poolingDesc, CUDNN_POOLING_MAX, R, S, pad_h, pad_w, stride_h, stride_w); + cudnnSetPooling2dDescriptor(poolingDesc, CUDNN_POOLING_MAX, CUDNN_PROPAGATE_NAN, R, S, pad_h, pad_w, stride_h, stride_w); return poolingDesc; } @@ -474,10 +478,13 @@ public class LibMatrixCUDA { // Allocate descriptors srcTensorDesc = allocateTensorDescriptor((int)N, 1, (int)H, (int)W); dstTensorDesc = allocateTensorDescriptor((int)N, 1, (int)H, (int)W); - - cudnnActivationForward(cudnnHandle, CUDNN_ACTIVATION_RELU, - alpha, srcTensorDesc, srcData, - beta, dstTensorDesc, dstData); + cudnnActivationDescriptor activationDescriptor = new cudnnActivationDescriptor(); + cudnnCreateActivationDescriptor(activationDescriptor); + double dummy = -1; + cudnnSetActivationDescriptor(activationDescriptor, CUDNN_ACTIVATION_RELU, CUDNN_PROPAGATE_NAN, dummy); + cudnnActivationForward(cudnnHandle, activationDescriptor, + alpha, srcTensorDesc, srcData, + beta, dstTensorDesc, dstData); } } finally {