Hi @xidulu. I did not look at the differences in the implementation of
host-side vs device-side API for RNG in MXNet, but if they are comparable in
terms of performance, a possible better approach would be something like this:
- launch only as many blocks and threads as necessary to fill the GPU, each
having their own RNG
- use following pseudocode
```
while(my_sample_id < N_samples) {
float rng = generate_next_rng();
bool accepted = ... // compute whether this rng value is accepted
if (accepted) {
// write the result
my_sample_id = next_sample();
}
}
```
There are 2 ways of implementing `next_sample` here - either by `atomicInc` on
some global counter or just by adding the total number of threads (so every
thread processes the same number of samples). The atomic approach is
potentially faster (as with the static assignment you could end up hitting a
corner case where 1 thread would still do a lot more work than the other
threads), but is nondeterministic, so I think static assignment is preferable
here.
--
You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/15928#issuecomment-522055756