Hello everyone,
We're actively working on adding native CUDA support to Apache Mahout.
Currently, GPU acceleration is enabled through ViennaCL (
http://viennacl.sourceforge.net/). ViennaCL is a linear algebra framework
that provides multiple backends including OpenMP, OpenCL and CUDA. However,
as we recently discovered the CUDA backend in ViennaCL is composed of
manually written CUDA kernels that are not well tuned for the latest GPU
architectures. Instead, we decided to explore a way to leverage CUDA
libraries for linear algebra: cuBLAS (dense matrices), cuSPARSE (sparse
matrices) and cuSOLVER (dense factorizations and sparse solvers). These
libraries are highly tuned by NVIDIA and provide the best performance for
many linear algebra primitives on the NVIDIA GPU architecture. Moreover,
the libraries are receiving frequent updates with new CUDA toolkit
releases: bug fixes, new functionality and optimizations.
We considered two approaches:
1. Direct calls to CUDA runtime and libraries through JavaCPP bridge
2. Use JCuda package (http://www.jcuda.org/)
JCuda is a thin Java layer on top of the CUDA runtime and already provides
Java wrappers for all available CUDA libraries so it makes sense to choose
this path. JCuda also provides a mechanism to call custom CUDA kernels by
compiling them into PTX with NVIDIA NVCC compiler and then loading through
CUDA driver API calls in Java code. Here is an example code that allocates
a pointer (cudaMalloc) and copies data to the GPU (cudaMemcpy) using JCuda:
// Allocate memory on the device
Pointer deviceData = new Pointer();
cudaMalloc(deviceData, memorySize);
// Copy the host data to the device
cudaMemcpy(deviceData, Pointer.to(hostData), memorySize,
cudaMemcpyKind.cudaMemcpyHostToDevice);
Alternatively, a pointer can be allocated using cudaMallocManaged and then
it can be accessed on the CPU or on the GPU without explicit copies by
leveraging Unified Memory. This enables simpler data management model and
on the newer architectures enables features like on-demand paging and
transparent GPU memory oversubscription.
All CUDA libraries operate directly on the GPU pointers. Here is an
example of calling a single-precision GEMM with JCuda:
// Allocate memory on the device
Pointer d_A = new Pointer();
Pointer d_B = new Pointer();
Pointer d_C = new Pointer();
cudaMalloc(d_A, n * n * Sizeof.FLOAT);
cudaMalloc(d_B, n * n * Sizeof.FLOAT);
cudaMalloc(d_C, n * n * Sizeof.FLOAT);
// Copy the memory
// Execute sgemm
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
pAlpha, d_A, n, d_B, n, pBeta, d_C, n);
Most of existing sparse matrix classes and sparse matrix conversion
routines in Mahout can generally maintain their structure as the CSR format
is well-supported in both cuSPARSE and cuSOLVER libraries.
Our plan is to create a proof-of-concept implementation first to
demonstrate matrix-matrix and/or matrix-vector multiplication using CUDA
libraries, then expand functionality by adding more BLAS operations and
advanced algorithms that exist in cuSOLVER. Stay tuned for more updates!
Regards,
Nikolai.
------------------------------------------------------------
-----------------------
This email message is for the sole use of the intended recipient(s) and
may contain
confidential information. Any unauthorized review, use, disclosure or
distribution
is prohibited. If you are not the intended recipient, please contact the
sender by
reply email and destroy all copies of the original message.
------------------------------------------------------------
-----------------------