thanks.

JCuda sounds good. :)

On Fri, Mar 10, 2017 at 9:06 AM, Nikolai Sakharnykh <nsakharn...@nvidia.com>
wrote:

> Hello everyone,
>
> We're actively working on adding native CUDA support to Apache Mahout.
> Currently, GPU acceleration is enabled through ViennaCL (
> http://viennacl.sourceforge.net/). ViennaCL is a linear algebra framework
> that provides multiple backends including OpenMP, OpenCL and CUDA. However,
> as we recently discovered the CUDA backend in ViennaCL is composed of
> manually written CUDA kernels that are not well tuned for the latest GPU
> architectures. Instead, we decided to explore a way to leverage CUDA
> libraries for linear algebra: cuBLAS (dense matrices), cuSPARSE (sparse
> matrices) and cuSOLVER (dense factorizations and sparse solvers). These
> libraries are highly tuned by NVIDIA and provide the best performance for
> many linear algebra primitives on the NVIDIA GPU architecture. Moreover,
> the libraries are receiving frequent updates with new CUDA toolkit
> releases: bug fixes, new functionality and optimizations.
>
> We considered two approaches:
>
>   1.  Direct calls to CUDA runtime and libraries through JavaCPP bridge
>   2.  Use JCuda package (http://www.jcuda.org/)
>
> JCuda is a thin Java layer on top of the CUDA runtime and already provides
> Java wrappers for all available CUDA libraries so it makes sense to choose
> this path. JCuda also provides a mechanism to call custom CUDA kernels by
> compiling them into PTX with NVIDIA NVCC compiler and then loading through
> CUDA driver API calls in Java code. Here is an example code that allocates
> a pointer (cudaMalloc) and copies data to the GPU (cudaMemcpy) using JCuda:
>
> // Allocate memory on the device
> Pointer deviceData = new Pointer();
> cudaMalloc(deviceData, memorySize);
>
> // Copy the host data to the device
> cudaMemcpy(deviceData, Pointer.to(hostData), memorySize,
>           cudaMemcpyKind.cudaMemcpyHostToDevice);
>
> Alternatively, a pointer can be allocated using cudaMallocManaged and then
> it can be accessed on the CPU or on the GPU without explicit copies by
> leveraging Unified Memory. This enables simpler data management model and
> on the newer architectures enables features like on-demand paging and
> transparent GPU memory oversubscription.
>
> All CUDA libraries operate directly on the GPU pointers. Here is an
> example of calling a single-precision GEMM with JCuda:
>
> // Allocate memory on the device
> Pointer d_A = new Pointer();
> Pointer d_B = new Pointer();
> Pointer d_C = new Pointer();
> cudaMalloc(d_A, n * n * Sizeof.FLOAT);
> cudaMalloc(d_B, n * n * Sizeof.FLOAT);
> cudaMalloc(d_C, n * n * Sizeof.FLOAT);
>
> // Copy the memory
>
> // Execute sgemm
> cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
>            pAlpha, d_A, n, d_B, n, pBeta, d_C, n);
>
> Most of existing sparse matrix classes and sparse matrix conversion
> routines in Mahout can generally maintain their structure as the CSR format
> is well-supported in both cuSPARSE and cuSOLVER libraries.
>
> Our plan is to create a proof-of-concept implementation first to
> demonstrate matrix-matrix and/or matrix-vector multiplication using CUDA
> libraries, then expand functionality by adding more BLAS operations and
> advanced algorithms that exist in cuSOLVER. Stay tuned for more updates!
>
> Regards,
> Nikolai.
>
>
> ------------------------------------------------------------
> -----------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
> ------------------------------------------------------------
> -----------------------
>

Reply via email to