Thank you, Nikolai. This is Great news! ________________________________ From: Dmitriy Lyubimov <dlie...@gmail.com> Sent: Monday, March 27, 2017 7:55:20 PM To: dev@mahout.apache.org Subject: Re: Native CUDA support
thanks. JCuda sounds good. :) On Fri, Mar 10, 2017 at 9:06 AM, Nikolai Sakharnykh <nsakharn...@nvidia.com> wrote: > Hello everyone, > > We're actively working on adding native CUDA support to Apache Mahout. > Currently, GPU acceleration is enabled through ViennaCL ( > http://viennacl.sourceforge.net/). ViennaCL is a linear algebra framework > that provides multiple backends including OpenMP, OpenCL and CUDA. However, > as we recently discovered the CUDA backend in ViennaCL is composed of > manually written CUDA kernels that are not well tuned for the latest GPU > architectures. Instead, we decided to explore a way to leverage CUDA > libraries for linear algebra: cuBLAS (dense matrices), cuSPARSE (sparse > matrices) and cuSOLVER (dense factorizations and sparse solvers). These > libraries are highly tuned by NVIDIA and provide the best performance for > many linear algebra primitives on the NVIDIA GPU architecture. Moreover, > the libraries are receiving frequent updates with new CUDA toolkit > releases: bug fixes, new functionality and optimizations. > > We considered two approaches: > > 1. Direct calls to CUDA runtime and libraries through JavaCPP bridge > 2. Use JCuda package (http://www.jcuda.org/) > > JCuda is a thin Java layer on top of the CUDA runtime and already provides > Java wrappers for all available CUDA libraries so it makes sense to choose > this path. JCuda also provides a mechanism to call custom CUDA kernels by > compiling them into PTX with NVIDIA NVCC compiler and then loading through > CUDA driver API calls in Java code. Here is an example code that allocates > a pointer (cudaMalloc) and copies data to the GPU (cudaMemcpy) using JCuda: > > // Allocate memory on the device > Pointer deviceData = new Pointer(); > cudaMalloc(deviceData, memorySize); > > // Copy the host data to the device > cudaMemcpy(deviceData, Pointer.to(hostData), memorySize, > cudaMemcpyKind.cudaMemcpyHostToDevice); > > Alternatively, a pointer can be allocated using cudaMallocManaged and then > it can be accessed on the CPU or on the GPU without explicit copies by > leveraging Unified Memory. This enables simpler data management model and > on the newer architectures enables features like on-demand paging and > transparent GPU memory oversubscription. > > All CUDA libraries operate directly on the GPU pointers. Here is an > example of calling a single-precision GEMM with JCuda: > > // Allocate memory on the device > Pointer d_A = new Pointer(); > Pointer d_B = new Pointer(); > Pointer d_C = new Pointer(); > cudaMalloc(d_A, n * n * Sizeof.FLOAT); > cudaMalloc(d_B, n * n * Sizeof.FLOAT); > cudaMalloc(d_C, n * n * Sizeof.FLOAT); > > // Copy the memory > > // Execute sgemm > cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n, > pAlpha, d_A, n, d_B, n, pBeta, d_C, n); > > Most of existing sparse matrix classes and sparse matrix conversion > routines in Mahout can generally maintain their structure as the CSR format > is well-supported in both cuSPARSE and cuSOLVER libraries. > > Our plan is to create a proof-of-concept implementation first to > demonstrate matrix-matrix and/or matrix-vector multiplication using CUDA > libraries, then expand functionality by adding more BLAS operations and > advanced algorithms that exist in cuSOLVER. Stay tuned for more updates! > > Regards, > Nikolai. > > > ------------------------------------------------------------ > ----------------------- > This email message is for the sole use of the intended recipient(s) and > may contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > ------------------------------------------------------------ > ----------------------- >