On 9/20/18 5:15 PM, Duncan Murdoch wrote: > On 20/09/2018 6:59 AM, Ralf Stubner wrote: >> It is difficult to do this in a package, since R does not provide access >> to the random bits generated by the RNG. Only a float in (0,1) is >> available via unif_rand(). > > I believe it is safe to multiply the unif_rand() value by 2^32, and take > the whole number part as an unsigned 32 bit integer. Depending on the > RNG in use, that will give at least 25 random bits. (The low order bits > are the questionable ones. 25 is just a guess, not a guarantee.)
Right, the RNGs in R produce no more than 32bits, so the conversion to a double can be reverted. If we ignore those RNGs that produce less than 32bits for the moment, then the attached file contains a sample implementation (without long vectors, weighted sampling or hashing). It uses Rcpp for convenience, but I have tried to keep the C++ low. Interesting results: The results for "simple" sampling are the same. > set.seed(42) > sample.int(6, 10, replace = TRUE) [1] 6 6 2 5 4 4 5 1 4 5 > sample.int(100, 10) [1] 46 72 92 25 45 90 98 11 44 51 > set.seed(42) > sample_int(6, 10, replace = TRUE) [1] 6 6 2 5 4 4 5 1 4 5 > sample_int(100, 10) [1] 46 72 92 25 45 90 98 11 44 51 But there is no bias with the alternative method: > m <- ceiling((2/5)*2^32) > set.seed(42) > x <- sample.int(m, 1000000, replace = TRUE) > table(x %% 2) 0 1 467768 532232 > set.seed(42) > y <- sample_int(m, 1000000, replace = TRUE) > table(y %% 2) 0 1 500586 499414 The differences are also visible when sampling only a few values from 'm' possible values: > set.seed(42) > sample.int(m, 6, replace = TRUE) [1] 1571624817 1609883303 491583978 1426698159 1102510407 891800051 > set.seed(42) > sample_int(m, 6, replace = TRUE) [1] 491583978 1426698159 1102510407 891800051 1265449090 231355453 When sampling from 'm', performance is not so good since we often have to get a second random number: > bench::mark(orig = sample.int(m, 1000000, replace = TRUE), + new = sample_int(m, 1000000, replace = TRUE), + check = FALSE) # A tibble: 2 x 14 expression min mean median max `itr/sec` mem_alloc n_gc n_itr <chr> <bch:t> <bch:t> <bch:t> <bch> <dbl> <bch:byt> <dbl> <int> 1 orig 8.15ms 8.67ms 8.43ms 10ms 115. 3.82MB 4 52 2 new 25.21ms 25.58ms 25.45ms 27ms 39.1 3.82MB 2 18 # ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, # time <list>, gc <list> When sampling from fewer values, the difference is much less pronounced: > bench::mark(orig = sample.int(6, 1000000, replace = TRUE), + new = sample_int(6, 1000000, replace = TRUE), + check = FALSE) # A tibble: 2 x 14 expression min mean median max `itr/sec` mem_alloc n_gc n_itr <chr> <bch:t> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> 1 orig 8.14ms 8.44ms 8.29ms 9.58ms 118. 3.82MB 4 54 2 new 11.13ms 11.66ms 11.23ms 12.98ms 85.8 3.82MB 3 39 # ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, # time <list>, gc <list> > Another useful diagnostic is > > plot(density(y[y %% 2 == 0])) > > Obviously that should give a more or less uniform density, but for > values near m, the default sample() gives some nice pretty pictures of > quite non-uniform densities. Indeed. Adding/subtracting numbers < 10 to/from 'm' gives "interesting" curves. > By the way, there are actually quite a few examples of very large m > besides m = (2/5)*2^32 where performance of sample() is noticeably bad. > You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) > * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + > 3a)*2^32, etc. > > So perhaps I'm starting to be convinced that the default sample() should > be fixed. I have the impression that Lemire's method gives the same results unless it is correcting for the bias that exists in the current method. If that is really the case, then the disruption should be rather minor. The ability to fall back to the old behavior would still be useful, though. cheerio ralf -- Ralf Stubner Senior Software Engineer / Trainer daqana GmbH Dortustraße 48 14467 Potsdam T: +49 331 23 61 93 11 F: +49 331 23 61 93 90 M: +49 162 20 91 196 Mail: ralf.stub...@daqana.com Sitz: Potsdam Register: AG Potsdam HRB 27966 P Ust.-IdNr.: DE300072622 Geschäftsführer: Prof. Dr. Dr. Karl-Kuno Kunze
signature.asc
Description: OpenPGP digital signature
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel