ptrendx commented on issue #15928: [RFC] A faster version of Gamma sampling on
GPU.
URL:
https://github.com/apache/incubator-mxnet/issues/15928#issuecomment-522289104
@yzhliu No. What MXNet currently does is a scheme where, yes, each thread
gets assigned statically some number of elements, but it has a while loop for
each of them. The scheme I proposed has a single while loop that processes all
elements assigned to a given thread. There is a big difference between these
approaches, due to SIMT architecture of the GPU. Basically you can treat some
number of threads (called warp, 32 threads on NVIDIA's GPU) as lanes in SIMD
vector instruction on the CPU. This means that if 1 thread needs to perform
some computation, all threads in the warp need to perform the same instruction
(and possibly discard the result).
So in the current MXNet's implementation for each output element every group
of 32 threads is always doing the number of loop iterations equal to the
slowest thread (because no thread in warp can exit the while loop while at
least 1 thread is still not finished).
In the proposed implementation there is only 1 while loop and the only
difference between threads lies inside the `if (accepted)` part, which is cheap
compared to generating a random number. In this implementation every warp does
the number of loop iterations equal to sum of the steps for the slowest thread
(which is hopefully pretty uniform across threads, especially as we are talking
RNG and not some crafted input, and definitely much better than the previous
"for each element take the slowest and sum that").
@xidulu What is the RNG used for host-side and device-side API? cuRAND ones
should not really differ much in perf between device-side and host-side.
There are a few advantages:
- you don't need to store and load the RNG numbers you made (and in the
fully optimized case making random numbers should actually be pretty
bandwidth-limited operation)
- you don't need additional storage (besides the RNG generator state which
you need anyway)
- you compute only as many RNG numbers as you really need
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
With regards,
Apache Git Services