chris-1187 commented on code in PR #2186: URL: https://github.com/apache/systemds/pull/2186#discussion_r1941499096
########## scripts/staging/cuda-counter-based-prng/readme.md: ########## @@ -0,0 +1,410 @@ +# CUDA counter based PRNG + +Currently, random matrix generation is done using Java implementations. Either the Java Random class or the custom +counter based Philox4x64 implementation is used. This is not efficient for large matrices because first, Java is slow +and second, the matrix has to be copied from the main memory to the GPUs memory for performing matrix operations there. +We propose to implement a counter-based PRNG on CUDA to generate random matrices directly on the GPU. + +To be consistent with the current counter based PRNG implementation, we will use the Philox4x64 algorithm. +Unfortunately, the CUDA curand library is not open source, and we failed to replicate the numbers generated by the +curand library using a Java implementation. We therefore propose to use the random123 library, which is an open-source +library that implements the Philox4x64 algorithm under BSD-3 license. The random123 library is available +at https://github.com/DEShawResearch/random123. It is well tested using statistical tests as described in the +paper [Parallel random numbers: as easy as 1, 2, 3](https://doi.org/10.1145/2063384.2063405). + +## How to implement + +There are two ways how to integrate cuda kernels into the SystemDS project. The first way is to ship a precompiled +cuda ptx file with the SystemDS project. This has the drawback that the cuda ptx file has to be compiled for each +cuda version and each gpu architecture. + +The second way is to compile the cuda kernels during runtime. This means, the cuda build tools need to be installed +on the system where the SystemDS project is running, but the cuda ptx file can be compiled for the specific cuda +version and gpu architecture. + +### Precompiled cuda ptx file + +Example cuda kernel: + +```c++ +extern "C" __global__ void philox_4_64(ulong* output, uint64_t seed, uint64_t startingCounter, size_t numElements) { + // Calculate the thread's unique index + uint64_t idx = blockIdx.x * blockDim.x + threadIdx.x; + + // Ensure the thread index is within bounds + if (idx * 4 < numElements) { + // Initialize the Philox generator with a unique counter and key + r123::Philox4x64 rng; + r123::Philox4x64::ctr_type ctr = {{startingCounter + idx, 0, 0, 0}}; // Counter (startingCounter + thread index) + r123::Philox4x64::key_type key = {{seed}}; // Key (seed) + + // Generate 4 random integers + r123::Philox4x64::ctr_type result = rng(ctr, key); + + for (int i = 0; i < 4; ++i) { + size_t outputIdx = idx * 4 + i; + + // Ensure we don't exceed the output array bounds + if (outputIdx < numElements) { + output[outputIdx] = result[i]; + } + } + } +} +``` + +To compile the cuda kernel to a ptx file, you can use the following command: + +```bash +/usr/local/cuda/bin/nvcc kernel.cu -ccbin gcc-8 -lstdc++ -I ./random123/include -o cuda_test.ptx -lm --ptx -std=c++11 --gpu-architecture=sm_70 +``` + +This will compile the cuda kernel to a ptx file that can be shipped with the SystemDS project. + +```ptx +.version 6.5 +.target sm_70 +.address_size 64 + +.visible .entry philox_4_64( + .param .u64 philox_4_64_param_0, + .param .u64 philox_4_64_param_1, + .param .u64 philox_4_64_param_2, + .param .u64 philox_4_64_param_3 +) +{ + ... cuda kernel code ... +} + +``` +To use this ptx file in the SystemDS project, you can use the following code: + +```java +import jcuda.*; Review Comment: resolved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org