subject:"\[GitHub\] \[incubator\-mxnet\] ptrendx commented on issue #15928\: \[RFC\] A faster version of Gamma sampling on GPU."

[GitHub] [incubator-mxnet] ptrendx commented on issue #15928: [RFC] A faster version of Gamma sampling on GPU.

2019-08-17 Thread GitBox

ptrendx commented on issue #15928: [RFC] A faster version of Gamma sampling on 
GPU.
URL: 
https://github.com/apache/incubator-mxnet/issues/15928#issuecomment-522289104
 
 
   @yzhliu No. What MXNet currently does is a scheme where, yes, each thread 
gets assigned statically some number of elements, but it has a while loop for 
each of them. The scheme I proposed has a single while loop that processes all 
elements assigned to a given thread. There is a big difference between these 
approaches, due to SIMT architecture of the GPU. Basically you can treat some 
number of threads (called warp, 32 threads on NVIDIA's GPU) as lanes in SIMD 
vector instruction on the CPU. This means that if 1 thread needs to perform 
some computation, all threads in the warp need to perform the same instruction 
(and possibly discard the result).
   So in the current MXNet's implementation for each output element every group 
of 32 threads is always doing the number of loop iterations equal to the 
slowest thread (because no thread in warp can exit the while loop while at 
least 1 thread is still not finished).
   In the proposed implementation there is only 1 while loop and the only 
difference between threads lies inside the `if (accepted)` part, which is cheap 
compared to generating a random number. In this implementation every warp does 
the number of loop iterations equal to sum of the steps for the slowest thread 
(which is hopefully pretty uniform across threads, especially as we are talking 
RNG and not some crafted input, and definitely much better than the previous 
"for each element take the slowest and sum that").
   
   @xidulu What is the RNG used for host-side and device-side API? cuRAND ones 
should not really differ much in perf between device-side and host-side.
   There are a few advantages:
- you don't need to store and load the RNG numbers you made (and in the 
fully optimized case making random numbers should actually be pretty 
bandwidth-limited operation)
- you don't need additional storage (besides the RNG generator state which 
you need anyway)
- you compute only as many RNG numbers as you really need


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on issue #15928: [RFC] A faster version of Gamma sampling on GPU.

2019-08-16 Thread GitBox

ptrendx commented on issue #15928: [RFC] A faster version of Gamma sampling on 
GPU.
URL: 
https://github.com/apache/incubator-mxnet/issues/15928#issuecomment-522055756
 
 
   Hi @xidulu. I did not look at the differences in the implementation of 
host-side vs device-side API for RNG in MXNet, but if they are comparable in 
terms of performance, a possible better approach would be something like this:
- launch only as many blocks and threads as necessary to fill the GPU, each 
having their own RNG
- use following pseudocode
   ```
   while(my_sample_id < N_samples) {
 float rng = generate_next_rng();
 bool accepted = ... // compute whether this rng value is accepted
 if (accepted) {
   // write the result
   my_sample_id = next_sample();
 }
   }
   ```
   There are 2 ways of implementing `next_sample` here - either by `atomicInc` 
on some global counter or just by adding the total number of threads (so every 
thread processes the same number of samples). The atomic approach is 
potentially faster (as with the static assignment you could end up hitting a 
corner case where 1 thread would still do a lot more work than the other 
threads), but is nondeterministic, so I think static assignment is preferable 
here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx commented on issue #15928: [RFC] A faster version of Gamma sampling on GPU.

[GitHub] [incubator-mxnet] ptrendx commented on issue #15928: [RFC] A faster version of Gamma sampling on GPU.

2 matches

Site Navigation

Mail list logo

Footer information