Re: Picky, but much more efficient arc4random_uniform!
We will not be using your code. End of story. Luke Small wrote: > Marc, all you all have to do is say is that you all refuse to provide it.
Re: Picky, but much more efficient arc4random_uniform!
Marc, all you all have to do is say is that you all refuse to provide it. I was asked to at least provide evidence for correctness. I did so; and I’d say I did a stellar job aside from getting some kind of statistical program. The following has an attached source code for my test (along with referencing another post) in case you missed it in this ginormous thread: https://marc.info/?l=openbsd-tech=165306234108572=2 Aside from multithreaded processes, if you provide this, folks might use this instead of creating their own incorrect custom random number generators to improve performance for smaller values. That custom deterministic “arc4random()” earlier in the thread from Damien was not very good by the way. I tested it. > You're about one step away from getting into my spam rules. > > Whatever you say, the way you say it, *repeatedly* when all the people > I know (me included) tell you to get your facts straight so you don't > sound like a wacko, makes reading you a total waste of time. > -- -Luke
Re: Picky, but much more efficient arc4random_uniform!
On Sat, May 21, 2022 at 07:47:40AM -0500, Luke Small wrote: > Perhaps it was rude sending off list stuff to the list. Your email sounded > "less than friendly" and more of a professional challenge that you were > definitely in the works to produce; much like Damien Miller’s challenge to > prove correctness. So, whatever. You're about one step away from getting into my spam rules. Whatever you say, the way you say it, *repeatedly* when all the people I know (me included) tell you to get your facts straight so you don't sound like a wacko, makes reading you a total waste of time.
Re: Picky, but much more efficient arc4random_uniform!
Perhaps it was rude sending off list stuff to the list. Your email sounded "less than friendly" and more of a professional challenge that you were definitely in the works to produce; much like Damien Miller’s challenge to prove correctness. So, whatever. Aside from that unpleasantness: I worked in __builtin_clz(upperbound-1) mentioned earlier in the thread; instead of my binary search. It made the knuth sort simulation run even faster. arc4random_uniform now takes 130% of the time mine takes for a series of random numbers decreasing from 65535 to 2. “__builtin_clz((upperbound - 1) | 1)” is only needed if upperbound can be 1 and that possibility is eliminated by first checking to see that upperbound is >= 2. static uint32_t arc4random_uniform_small_unlocked(uint32_t upper_bound) { static uint64_t rand_holder = 0; static size_t rand_bits = 0; static size_t upper_bound0 = 2; static size_t bitmask = 0x01; static size_t bits = 1; const size_t ub = upper_bound; size_t ret; if (ub != upper_bound0) { if (ub < 2) { /* * reset static cache for if a process needs to fork() * to make it manually fork-safe */ if (ub == 0) { rand_holder = 0; rand_bits = 0; } return 0; } bits = 32 - __builtin_clz(ub - 1); bitmask = ((size_t)1 << bits) - 1; upper_bound0 = ub; } do { if (rand_bits < bits) { rand_holder |= ((uint64_t)arc4random()) << rand_bits; /* * rand_bits will be between 0 and 31 here * so the 0x20 bit will be empty * rand_bits += 32; */ rand_bits |= 32; } ret = rand_holder & bitmask; rand_holder >>= bits; rand_bits -= bits; } while (ret >= ub); return (uint32_t)ret; } > Luke, > > It's very bad etiquette to deliberately re-post a private, off-list comment > to a public mailing list. > Also, please fix your email client to respect the Mail-Followup-To: header, > this is another lack of etiquette on your part. > I am either using gmail app on a phone or gmail.com, so I don't know if I can help you there.
Re: Picky, but much more efficient arc4random_uniform!
On Fri, May 20, 2022 at 09:48:28PM -0500, Luke Small wrote: > Crystal: You can prove that for random, repetitive, correct, database > record name generation using small upperbounds, the demonstrated 1/3-1/2 > runtime isn???t worth it for an upperbound like 26 - 92 in a business context > that fights for every last millisecond? > > Bring it. > > Prove the correctness of whatever algorithm you???re using while you???re at > it. Luke, It's very bad etiquette to deliberately re-post a private, off-list comment to a public mailing list. Just remember that you might need my help one day, and if your mail has been sent to /dev/null or otherwise ignored then I will never see it. It's also very rude to make demands when I've already told you that I have other work of a much higher priority. IF AND ONLY IF there is sufficient interest from OTHER people, (who can indicate it to me OFF LIST), then I will consider writing up a comprehensive article debunking your ideas and concepts. _Quality_ work takes time, and I've only just finished writing up our response to the claim that our asm optimisation in the rasops bit shifting code wasn't beneficial over the C compiler output, (which of course it was). Also, please fix your email client to respect the Mail-Followup-To: header, this is another lack of etiquette on your part.
Re: Picky, but much more efficient arc4random_uniform!
Crystal: You can prove that for random, repetitive, correct, database record name generation using small upperbounds, the demonstrated 1/3-1/2 runtime isn’t worth it for an upperbound like 26 - 92 in a business context that fights for every last millisecond? Bring it. Prove the correctness of whatever algorithm you’re using while you’re at it. …unless a lot of those processes are threaded of course. :/ On Fri, May 20, 2022 at 7:27 PM Crystal Kolipe wrote: > > I've actually written a program to demonstrate how pointless your chacha > bit > saving routine is :). I just haven't posted it yet because I'm too busy > with > other, (more useful), things to bother finishing it off. > > Your thread on -tech has already wasted more bits than your random number > routine would save in a very long time, ha ha ha. > -- -Luke
Re: Picky, but much more efficient arc4random_uniform!
I appreciate your response, Damien. I did do the bare minimum of correctness testing and it was the post right after Guenther was congratulated on proving incorrectness: https://marc.info/?l=openbsd-tech=165259528425835=2 The post includes software to reproduce the results. I wrote a highly dynamically allocated program to test at intervals of intervals to show at various stages to show the degree that the output remains random This is an example of some output: testing arc4random_uniform(5) and arc4random_uniform_small_unlocked(5): 256X 2048X16384X 131072X 1048576X 0 original246 2053 16400131115 1048306 mine263 2042 16312131258 1046989 1 original234 2013 16337130894 1047810 mine249 2022 16304131161 1049511 2 original236 2061 16367130802 1047430 mine257 2117 16597130718 1046375 3 original284 2089 16444131092 1050332 mine266 2058 16379131190 1049877 4 original280 2024 16372131457 1049002 mine245 2001 16328131033 1050128 max-min values: original 5076 107 655 2902 mine 21 116 293 540 3753 The program is set up to test values 2 through 50. This will create 224KiB of output. I suspected that you'd prefer that I didn't attach it. Some progress output has been relegated to stderr so that you can easily pipe it to a file. I added some getrlimit an setrlimit functions to maximize memory limits if necessary In the case that I created for you, this will not be needed. It uses 1276K RAM in the current configuration. > Just to bring this back to where we came in: even putting thread-safety > aside (which you absolutely can't): it doesn't matter how much faster > it is, your replacement function isn't useful until you do the work to > demonstrate correctness. > > You have done literally zero so far, not even the bare minimum of > testing the output. As a result your first version was demonstrated to > be completely broken by literally the most basic of possible tests, a > mere 10 lines of code. > > That you left this to others to do tells me that you fundamentally don't > understand the environment in which you're trying to operate, and that > you don't respect the time of the people you're trying to convince. > > Please stop wasting our time. > > -d -- -Luke #include #include #include #include #include #include #include extern char *malloc_options; /* cc arc4random_uniform_small.c -O2 -march=native -mtune=native -flto=full \ -static -p -fno-inline && ./a.out && gprof a.out gmon.out cc arc4random_uniform_small.c -O2 && ./a.out */ /* * uint32_t * arc4random_uniform(uint32_t upper_bound); * * for which this function is named, provides a cryptographically secure: * uint32_t arc4random() % upper_bound; * * this function minimizes the waste of randomly generated bits, * for small upper bounds. * * 'bits' is the requested number of random bits and it needs to be * within the following parameters: * * 1 << bits >= upper_bound > (1 << bits) >> 1 * * I make a possibly incorrect presumption that size_t is * at the very least a uint32_t, but probably a uint64_t for speed */ static uint32_t arc4random_uniform_small_unlocked(uint32_t upper_bound) { static uint64_t rand_holder = 0; static uint64_t rand_bits = 0; static uint64_t upper_bound0 = 2; static uint64_t bits0 = 1; uint64_t bits = 16, i = 1, n = 32, ret, ub = upper_bound; if (ub != upper_bound0) { if (ub < 2) { /* * reset static cache for if a process needs to fork() * to make it manually fork-safe */ if (ub == 0) { rand_holder = 0; rand_bits = 0; } return 0; } /* * binary search for ideal size random bitstream selection */ for (;;) { if ( ub > ((uint64_t)1 << bits) ) { if (n - i == 1) { bits = n; break; } i = bits; bits = (i + n) / 2; continue; } if ( ub <= ((uint64_t)1 << bits) >> 1 ) { if (n - i == 1) { bits = n; break; } n = bits; bits = (i + n) / 2; continue; } break; } upper_bound0 = ub; bits0 = bits; } else bits = bits0; const uint64_t bitmask = ((uint64_t)1 << bits) - 1; do { if (rand_bits < bits) { rand_holder |= ((uint64_t)arc4random()) << rand_bits; /* * rand_bits will be a number between 0 and 31 here * so the 0x20 bit will be empty * rand_bits += 32; */ rand_bits |= 32; }
Fwd: Picky, but much more efficient arc4random_uniform!
I appreciate your response, Damien. I did do the bare minimum of correctness testing and it was the post right after Guenther was congratulated on proving incorrectness: https://marc.info/?l=openbsd-tech=165259528425835=2 The post includes software to reproduce the results. I wrote a highly dynamically allocated program to test at intervals of intervals to show at various stages to show the degree that the output remains random This is an example of some output: testing arc4random_uniform(5) and arc4random_uniform_small_unlocked(5): 256X 2048X16384X 131072X 1048576X 0 original246 2053 16400131115 1048306 mine263 2042 16312131258 1046989 1 original234 2013 16337130894 1047810 mine249 2022 16304131161 1049511 2 original236 2061 16367130802 1047430 mine257 2117 16597130718 1046375 3 original284 2089 16444131092 1050332 mine266 2058 16379131190 1049877 4 original280 2024 16372131457 1049002 mine245 2001 16328131033 1050128 max-min values: original 5076 107 655 2902 mine 21 116 293 540 3753 The program is set up to test values 2 through 50. This will create 224KiB of output. I suspected that you'd prefer that I didn't attach it. Some progress output has been relegated to stderr so that you can easily pipe it to a file. I added some getrlimit an setrlimit functions to maximize memory limits if necessary In the case that I created for you, this will not be needed. It uses 1276K RAM in the current configuration. > Just to bring this back to where we came in: even putting thread-safety > aside (which you absolutely can't): it doesn't matter how much faster > it is, your replacement function isn't useful until you do the work to > demonstrate correctness. > > You have done literally zero so far, not even the bare minimum of > testing the output. As a result your first version was demonstrated to > be completely broken by literally the most basic of possible tests, a > mere 10 lines of code. > > That you left this to others to do tells me that you fundamentally don't > understand the environment in which you're trying to operate, and that > you don't respect the time of the people you're trying to convince. > > Please stop wasting our time. > > -d -- -Luke arc4random_uniform_small.c Description: Binary data
Re: Picky, but much more efficient arc4random_uniform!
Joerg Sonnenberger wrote in : |Am Wed, May 18, 2022 at 12:49:21PM +0200 schrieb Steffen Nurpmeso: |> What surprised me was that the Apple code requires more calls, and |> that today divisions and multiplications still matter. I think it |> was the Cyrix 166+ (or was it Athlon 1600+) where +,-,<<,>> was |> one cycle, * was ten cycles, and %,/ was fourty cycles. But |> i think the Core i5 8th Gen that i have requires one cycle for all |> of them. | |I dare you to find any CPU optimized for speed where division and modulo |are as fast as multiplication. What you have nowadays is a single cycle |for basic operations like additions and shifts, about three cycles for |multiplications (on x86 at least) and an order of magnitude more for |division. That's not really surprising either if you consider that Thanks. I really had not looked since about 2005. (In fact i never really looked, i think a c't magazine CD once came over with x86 interrupt lists, CPU cycle usage, and a (sketchy) RFC collection.) Hm, it even seems the number of cycles increased for division .. i do not know which numbers i had seen by then, did they include cache work at all? Well. So this was that. |integer multiplication has a fast circuit implementation where as |division is often implemented as cascading conditional subtraction. |The difference matters enough turning the reminder operation in the ELF |hash into essentially (x - x * (1/size) * size) give a 2% speed up for a |large application like Firefox ten years ago. Interesting. Well you live in pkgsrc with all the mess that all those huge software suites introduce, whereas i still "jerk" if i produce software which needs more than EXEC (/ RODATA) and BSS, and "shudder" for all the CTOR and DTOR and TLS (but ok: cool) and all the other stuff that is found in ELF, and handled by the linker. (I have ELF v1.2 and TLS-ABI, and even glanced over them a really long time ago though. But you know, it is so far off! And it makes no fun thinking about that :)) |Joerg --End of --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Picky, but much more efficient arc4random_uniform!
Am Wed, May 18, 2022 at 12:49:21PM +0200 schrieb Steffen Nurpmeso: > What surprised me was that the Apple code requires more calls, and > that today divisions and multiplications still matter. I think it > was the Cyrix 166+ (or was it Athlon 1600+) where +,-,<<,>> was > one cycle, * was ten cycles, and %,/ was fourty cycles. But > i think the Core i5 8th Gen that i have requires one cycle for all > of them. I dare you to find any CPU optimized for speed where division and modulo are as fast as multiplication. What you have nowadays is a single cycle for basic operations like additions and shifts, about three cycles for multiplications (on x86 at least) and an order of magnitude more for division. That's not really surprising either if you consider that integer multiplication has a fast circuit implementation where as division is often implemented as cascading conditional subtraction. The difference matters enough turning the reminder operation in the ELF hash into essentially (x - x * (1/size) * size) give a 2% speed up for a large application like Firefox ten years ago. Joerg
Re: Picky, but much more efficient arc4random_uniform!
Steffen Nurpmeso wrote in <20220518104921.sr0ht%stef...@sdaoden.eu>: ... |The web site that has been linked from the man from the country |that has an even much worse Earth Country Overshoot Day than |Germany and is almost en par with Australia or even USA (April |3rd, pooh; never again a Saab! Cuba: Nov 25th, Jamaica Dec 20th) Finland is March 31st. Man is that dirty. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Picky, but much more efficient arc4random_uniform!
Philip Guenther wrote in : |On Tue, May 17, 2022 at 1:10 PM Steffen Nurpmeso \ |wrote: |> Joerg Sonnenberger wrote in |> : |>|Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small: |>|> I made a couple new versions of a new kind of arc4random_uniform-like |> ... |>|If your main use case is limiting the amount of cryptography when using |>|small bounds, there is a much simpler approach to be taken here. For |>|boundaries below 256, use arc4random_buf to extract one byte if bound is |>|a power of two, otherwise two. This gives most of the performance ... |> You can use (really implemented) _buf() if you need a 8-bit or |> 16-bit etc number. |> |> I find that _uniform() often makes no difference to a simple |> modulo because like the comment in _uniform() says "p > 0.5 (worst |>case, usually far better", and usually RNGs sprinkle bits nicely, | |What does that statement mean? You seem to be saying "module is uniform, |except when it isn't, which could be almost half the time for some cases, |but when it's uniform it's uniform, so why bother making it actually |correct and dependable". | |I mean, what does that _mean_??? It's as if I said "my text handling Well it means this thread was too long. |program handles all characters uniformly, except those with accents, but |that's less than 10% of the characters I type, so it handles all characters |uniformly." WTF, NO! But it also means that // calculates 2**32 % range uint32_t t = (-range) % range; for (;;) { uint32_t r = rng(); if (r >= t) where range is a very small number results in a very, very low probability that r>=t is not true. For 16-bit 0x two zero bytes had to occur in the upper 16-bit. And worse for 64-bit RNG. So this is what i said. It depends on the application. This gets hypothetic and is anyway beyond my mathematical capabilities. I mean i look at /etc/ssh/moduli, so much on cramping of random numbers. The web site that has been linked from the man from the country that has an even much worse Earth Country Overshoot Day than Germany and is almost en par with Australia or even USA (April 3rd, pooh; never again a Saab! Cuba: Nov 25th, Jamaica Dec 20th) said the bias for the number 52 is 0.0121%. And what i posted had ~0.008 that rand<0x01FF aka 32-bit high byte is 0, for 32-bit from getrandom(2) as well as mine (in "strong" mode, SipHash on ARC4). You need quite some samples to be able to find a statistical frequency of that order. And it depends on the application of that many samples exist. And even TCP from RFC 793 / September 1981 has a 32-bit sequence number. But sure Philip, of course, yes, of course you are right: simply call _uniform() and "milk the shit out of the random range" -- just use it and forget about the problem. What surprised me was that the Apple code requires more calls, and that today divisions and multiplications still matter. I think it was the Cyrix 166+ (or was it Athlon 1600+) where +,-,<<,>> was one cycle, * was ten cycles, and %,/ was fourty cycles. But i think the Core i5 8th Gen that i have requires one cycle for all of them. (And somewhere i read that there are CPUs where the shift operators are now more expensive than multiplication.) I do not really track any of that since 2005. That is nice: you buy a laptop and suddenly have a NVME SSD that can 1.5GB/sec. Wow. |> 0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in |> most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793 |> NOT for Linux getrandom(2)), which is a pretty high cut off. ... |Where do these ideas come from, that "0 bytes 'do not occur'"?? If your |rand generator doesn't provide zero bytes at the expected frequency, you |know, 1 in 256, then you're using a garbage random number generator. |Please stop making such suggestions here because THEY ARE NOT TRUE ABOUT |OPENBSD. Do ya'll not bother to test the claims that you make? ... |for (;;) { |u = arc4random(); |if ((u & 0xff00) == 0 || |(u & 0x00ff) == 0 || |(u & 0xff00) == 0 || |(u & 0x00ff) == 0) |break; That is any-byte-0. |00b82e5c58 |ab47880036 Seems random to me. :) --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Picky, but much more efficient arc4random_uniform!
On Wed, May 18, 2022 at 05:08:02PM +1000, Damien Miller wrote: > On Wed, 18 May 2022, Otto Moerbeek wrote: > > > instrumenting the code to count the number of arc4random calls I see thsi: > > > > openbsd; elapsed = 2.835819; calls = 12340949 > > bitmask; elapsed = 4.335576; calls = 17836216 > > bitmask+reuse; elapsed = 3.710277; calls = 15245337 > > > > (this is a different number of trials on a slow machine). > > > > So the promise of less calls is not fulfilled. Sounds like a bug. > > Well, I don't know whether the simple bitmasking approach will really > result in fewer calls. I *think* our current approach has a higher > probability per loop of suceeding (at least for small upper_bounds) than > mask-and-test, but my brain is too addled with a cold to calculate it :/ I *assumed* the speedups seen in the web page were because of less rnd calls. That was a mistake. > What Theo said is 100% right - the cost is dominated by that of the > underlying RNG. If anyone wants a faster arc4random_uniform() then the > first place to look it at arc4random(). One more reason to keep the current implementation of arc4random_uniform(). > > It's entirely possible that the speedups measured in that article > are because they are using a omgoptimised (non-crypto) RNG and the > improvements they saw were due solely to reduction the overhead inside > the uniformity code even if it came at the cost of more RNG queries. Right. > Anyway, here's a tweaked up version of the test harness that fakes out > arc4random() with a deterministic RNG that counts how often it is called > in case anyone wants to play with it further. -Otto > > > > > #include > #include > > #include > #include > > static uint32_t nqueries; > > /* deterministic, query-counting arc4random() replacement for benchmarking */ > static uint32_t > fake_random(void) > { > static uint32_t ready, remain, msb; > uint32_t ret; > > if (!ready) { > srand_deterministic(31337); > ready = 1; > } > if (remain == 0) { > msb = (uint32_t)rand(); > remain = 31; /* XXX makes assumption re RAND_MAX */ > } > ret = (uint32_t)rand() | (msb << 31); > msb >>= 1; > remain--; > nqueries++; > return ret; > } > > #define arc4random() fake_random() > > #ifndef __has_builtin > #define __has_builtin(x) 0 > #endif > #if __has_builtin(__builtin_clz) > #define arc4random_clz(x) __builtin_clz(x) > #else > #warning lacks __builtin_clz() > /* Count most-significant zero bits, like __builtin_clz() */ > static int > arc4random_clz(unsigned int x) > { > int ret = 0; > unsigned int n; > const int lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 }; > > while (x != 0) { > n = (x >> 28) & 0xf; > if (n != 0) > return ret + lut[n]; > x <<= 4; > ret += 4; > } > return 32; > } > #endif > > /* Current OpenBSD implementation */ > static uint32_t > arc4random_uniform1(uint32_t upper_bound) > { > uint32_t r, min; > > if (upper_bound < 2) > return 0; > > /* 2**32 % x == (2**32 - x) % x */ > min = -upper_bound % upper_bound; > > /* >* This could theoretically loop forever but each retry has >* p > 0.5 (worst case, usually far better) of selecting a >* number inside the range we need, so it should rarely need >* to re-roll. >*/ > for (;;) { > r = arc4random(); > if (r >= min) > break; > } > > return r % upper_bound; > } > > /* > * Like "Bitmask with Rejection" implementation from > * https://www.pcg-random.org/posts/bounded-rands.html > */ > static uint32_t > arc4random_uniform2(uint32_t upper_bound) > { > uint32_t mask, r; > > if (upper_bound < 2) > return 0; > > mask = 0x >> arc4random_clz((upper_bound - 1) | 1);; > for (;;) { > r = arc4random() & mask; > if (r < upper_bound) > return r; > } > /* NOTREACHED */ > } > > /* > * Like Apple's > * > https://opensource.apple.com/source/Libc/Libc-1158.50.2/gen/FreeBSD/arc4random.c > */ > static uint32_t > arc4random_uniform3(uint32_t upper_bound) > { > int zeroes, bits, remain; > uint32_t mask, r, rm; > > if (upper_bound < 2) > return 0; > > zeroes = arc4random_clz((upper_bound - 1) | 1); > bits = 32 - zeroes; > mask = (uint32_t)-1 >> zeroes; > > for (;;) { > r = arc4random(); > rm = r & mask; > if (rm < upper_bound) > return rm; > for (remain = zeroes; remain >= bits; remain -= bits) { > r >>= bits; > rm = r & mask; >
Re: Picky, but much more efficient arc4random_uniform!
On Wed, 18 May 2022, Otto Moerbeek wrote: > instrumenting the code to count the number of arc4random calls I see thsi: > > openbsd; elapsed = 2.835819; calls = 12340949 > bitmask; elapsed = 4.335576; calls = 17836216 > bitmask+reuse; elapsed = 3.710277; calls = 15245337 > > (this is a different number of trials on a slow machine). > > So the promise of less calls is not fulfilled. Sounds like a bug. Well, I don't know whether the simple bitmasking approach will really result in fewer calls. I *think* our current approach has a higher probability per loop of suceeding (at least for small upper_bounds) than mask-and-test, but my brain is too addled with a cold to calculate it :/ What Theo said is 100% right - the cost is dominated by that of the underlying RNG. If anyone wants a faster arc4random_uniform() then the first place to look it at arc4random(). It's entirely possible that the speedups measured in that article are because they are using a omgoptimised (non-crypto) RNG and the improvements they saw were due solely to reduction the overhead inside the uniformity code even if it came at the cost of more RNG queries. Anyway, here's a tweaked up version of the test harness that fakes out arc4random() with a deterministic RNG that counts how often it is called in case anyone wants to play with it further. #include #include #include #include static uint32_t nqueries; /* deterministic, query-counting arc4random() replacement for benchmarking */ static uint32_t fake_random(void) { static uint32_t ready, remain, msb; uint32_t ret; if (!ready) { srand_deterministic(31337); ready = 1; } if (remain == 0) { msb = (uint32_t)rand(); remain = 31; /* XXX makes assumption re RAND_MAX */ } ret = (uint32_t)rand() | (msb << 31); msb >>= 1; remain--; nqueries++; return ret; } #define arc4random() fake_random() #ifndef __has_builtin #define __has_builtin(x) 0 #endif #if __has_builtin(__builtin_clz) #define arc4random_clz(x) __builtin_clz(x) #else #warning lacks __builtin_clz() /* Count most-significant zero bits, like __builtin_clz() */ static int arc4random_clz(unsigned int x) { int ret = 0; unsigned int n; const int lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 }; while (x != 0) { n = (x >> 28) & 0xf; if (n != 0) return ret + lut[n]; x <<= 4; ret += 4; } return 32; } #endif /* Current OpenBSD implementation */ static uint32_t arc4random_uniform1(uint32_t upper_bound) { uint32_t r, min; if (upper_bound < 2) return 0; /* 2**32 % x == (2**32 - x) % x */ min = -upper_bound % upper_bound; /* * This could theoretically loop forever but each retry has * p > 0.5 (worst case, usually far better) of selecting a * number inside the range we need, so it should rarely need * to re-roll. */ for (;;) { r = arc4random(); if (r >= min) break; } return r % upper_bound; } /* * Like "Bitmask with Rejection" implementation from * https://www.pcg-random.org/posts/bounded-rands.html */ static uint32_t arc4random_uniform2(uint32_t upper_bound) { uint32_t mask, r; if (upper_bound < 2) return 0; mask = 0x >> arc4random_clz((upper_bound - 1) | 1);; for (;;) { r = arc4random() & mask; if (r < upper_bound) return r; } /* NOTREACHED */ } /* * Like Apple's * https://opensource.apple.com/source/Libc/Libc-1158.50.2/gen/FreeBSD/arc4random.c */ static uint32_t arc4random_uniform3(uint32_t upper_bound) { int zeroes, bits, remain; uint32_t mask, r, rm; if (upper_bound < 2) return 0; zeroes = arc4random_clz((upper_bound - 1) | 1); bits = 32 - zeroes; mask = (uint32_t)-1 >> zeroes; for (;;) { r = arc4random(); rm = r & mask; if (rm < upper_bound) return rm; for (remain = zeroes; remain >= bits; remain -= bits) { r >>= bits; rm = r & mask; if (rm < upper_bound) return rm; } } /* NOTREACHED */ } #include static uint32_t fixture(const char *s, uint32_t (* const rnd)(uint32_t)) { const uint32_t trials = 20 * 1000 * 1000; const uint32_t max = 0x8000; const uint32_t mul = 7; unsigned int v, i, r; struct timeval start, finish, delta; nqueries = 0; gettimeofday(,
Re: Picky, but much more efficient arc4random_uniform!
On Wed, May 18, 2022 at 02:50:28PM +1000, Damien Miller wrote: > On Tue, 17 May 2022, Raimo Niskanen wrote: > > > Why reinvent the wheel? > > > > Here is a pretty good walkthrough of established methods: > > > > https://www.pcg-random.org/posts/bounded-rands.html > > > > It sounds to me as if your suggested methor essentially is > > "Bitmask with Rejection -- Apple's Method" with the added twist > > to keep the unused bits of the base generator's word and append > > new, which might be an optimization for small ranges, but might > > be just overhead for large ones. > > > > In that post one can see that there might be other suggested smaller > > optimizations that can be applied to the OpenBSD method, and that > > the bitmask method is effective in many cases but not a clear winner. > > Oh nice, I wasn't aware that Apple had an improved algorithm. I always > thought the best avenue for a speedup was to consider only the bits that > could satisfy the constraint, but it never occurred to me how to actually > make use of this :) > > The toy benchmark below compares the existing implementation with > reimplementations of both their version as well as something more close > to Apple's actual method (which reuses the high bits). > > However, I don't see a speedup for either of the alternate approaches. > > [djm@djm ~]$ cc -O2 -Werror -Wall -Wextra -o rb rb.c && doas nice -n -20 ./rb > openbsd; elapsed = 8.327954 > bitmask; elapsed = 13.306228 > bitmask+reuse; elapsed = 11.567389 instrumenting the code to count the number of arc4random calls I see thsi: openbsd; elapsed = 2.835819; calls = 12340949 bitmask; elapsed = 4.335576; calls = 17836216 bitmask+reuse; elapsed = 3.710277; calls = 15245337 (this is a different number of trials on a slow machine). So the promise of less calls is not fulfilled. Sounds like a bug. -Otto
Re: Picky, but much more efficient arc4random_uniform!
> However, I don't see a speedup for either of the alternate approaches. Or maybe, just maybe, the underlying function (arc4random, which in the dominant case it is just a small chacha step) is so inexpensive relative to the proposed higher-level logic changes? So this is all tilting at windmills, and also something about measuring before optimizing^Wbreaking...
Re: Picky, but much more efficient arc4random_uniform!
On Tue, 17 May 2022, Raimo Niskanen wrote: > Why reinvent the wheel? > > Here is a pretty good walkthrough of established methods: > > https://www.pcg-random.org/posts/bounded-rands.html > > It sounds to me as if your suggested methor essentially is > "Bitmask with Rejection -- Apple's Method" with the added twist > to keep the unused bits of the base generator's word and append > new, which might be an optimization for small ranges, but might > be just overhead for large ones. > > In that post one can see that there might be other suggested smaller > optimizations that can be applied to the OpenBSD method, and that > the bitmask method is effective in many cases but not a clear winner. Oh nice, I wasn't aware that Apple had an improved algorithm. I always thought the best avenue for a speedup was to consider only the bits that could satisfy the constraint, but it never occurred to me how to actually make use of this :) The toy benchmark below compares the existing implementation with reimplementations of both their version as well as something more close to Apple's actual method (which reuses the high bits). However, I don't see a speedup for either of the alternate approaches. [djm@djm ~]$ cc -O2 -Werror -Wall -Wextra -o rb rb.c && doas nice -n -20 ./rb openbsd; elapsed = 8.327954 bitmask; elapsed = 13.306228 bitmask+reuse; elapsed = 11.567389 Maybe my benchmark is crap or maybe I need dlg@ to omgoptimize it for me... #include #include #include #include #ifndef __has_builtin #define __has_builtin(x) 0 #endif #if __has_builtin(__builtin_clz) #define arc4random_clz(x) __builtin_clz(x) #else #warning lacks __builtin_clz() /* Count most-significant zero bits, like __builtin_clz() */ static int arc4random_clz(unsigned int x) { int ret = 0; unsigned int n; const int lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 }; while (x != 0) { n = (x >> 28) & 0xf; if (n != 0) return ret + lut[n]; x <<= 4; ret += 4; } return 32; } #endif /* Current OpenBSD implementation */ static uint32_t arc4random_uniform1(uint32_t upper_bound) { uint32_t r, min; if (upper_bound < 2) return 0; /* 2**32 % x == (2**32 - x) % x */ min = -upper_bound % upper_bound; /* * This could theoretically loop forever but each retry has * p > 0.5 (worst case, usually far better) of selecting a * number inside the range we need, so it should rarely need * to re-roll. */ for (;;) { r = arc4random(); if (r >= min) break; } return r % upper_bound; } /* * Like "Bitmask with Rejection" implementation from * https://www.pcg-random.org/posts/bounded-rands.html */ static uint32_t arc4random_uniform2(uint32_t upper_bound) { uint32_t mask, r; if (upper_bound < 2) return 0; mask = (uint32_t)-1 >> arc4random_clz((upper_bound - 1) | 1);; for (;;) { r = arc4random(); if ((r & mask) < upper_bound) return r & mask; } /* NOTREACHED */ } /* * Like Apple's * https://opensource.apple.com/source/Libc/Libc-1158.50.2/gen/FreeBSD/arc4random.c */ static uint32_t arc4random_uniform3(uint32_t upper_bound) { int zeroes, bits, remain = 32; uint32_t mask, r; if (upper_bound < 2) return 0; zeroes = arc4random_clz((upper_bound - 1) | 1); bits = 32 - zeroes; mask = (uint32_t)-1 >> zeroes; for (;;) { r = arc4random(); if ((r & mask) < upper_bound) return r & mask; for (remain = zeroes; remain >= bits; remain -= bits) { r >>= bits; if ((r & mask) < upper_bound) return r & mask; } } /* NOTREACHED */ } #include static uint32_t fixture(const char *s, uint32_t (* const rnd)(uint32_t)) { const uint32_t trials = 50 * 1000 * 1000; const uint32_t max = 0x8000; const uint32_t mul = 7; unsigned int v, i, r; struct timeval start, finish, delta; gettimeofday(, NULL); for (v = 3, r = 0; v < max; v *= mul) { /* printf("%u\n", v); */ for (i = 0; i < trials; i++) r |= rnd(v); } gettimeofday(, NULL); timersub(, , ); printf("%s; elapsed = %lld.%06lld\n", s, (long long)delta.tv_sec, (long long)delta.tv_usec); return r; } int main(void) { fixture("openbsd", arc4random_uniform1); fixture("bitmask", arc4random_uniform2); fixture("bitmask+reuse", arc4random_uniform3); }
Re: Picky, but much more efficient arc4random_uniform!
On Tue, May 17, 2022 at 1:10 PM Steffen Nurpmeso wrote: > Joerg Sonnenberger wrote in > : > |Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small: > |> I made a couple new versions of a new kind of arc4random_uniform-like > ... > |If your main use case is limiting the amount of cryptography when using > |small bounds, there is a much simpler approach to be taken here. For > |boundaries below 256, use arc4random_buf to extract one byte if bound is > |a power of two, otherwise two. This gives most of the performance > |benefit without complicating the algorithm. Extracting two bytes ensures > |that the propability of success is > 99% and the double extracting > |doesn't eat up the benefits. > > You can use (really implemented) _buf() if you need a 8-bit or > 16-bit etc number. > > I find that _uniform() often makes no difference to a simple > modulo because like the comment in _uniform() says "p > 0.5 (worst case, usually far better", and usually RNGs sprinkle bits nicely, > What does that statement mean? You seem to be saying "module is uniform, except when it isn't, which could be almost half the time for some cases, but when it's uniform it's uniform, so why bother making it actually correct and dependable". I mean, what does that _mean_??? It's as if I said "my text handling program handles all characters uniformly, except those with accents, but that's less than 10% of the characters I type, so it handles all characters uniformly." WTF, NO! > 0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in > most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793 > NOT for Linux getrandom(2)), which is a pretty high cut off. > Using _uniform() just because of its name seems strange thus. > Where do these ideas come from, that "0 bytes 'do not occur'"?? If your rand generator doesn't provide zero bytes at the expected frequency, you know, 1 in 256, then you're using a garbage random number generator. Please stop making such suggestions here because THEY ARE NOT TRUE ABOUT OPENBSD. Do ya'll not bother to test the claims that you make? : bleys; cat f.c #include #include int main(void) { uint32_t u; long count; count = 0; while ((u = arc4random()) > 0x1ff) count++; printf("%08x\t%ld\n", u, count); count = 0; for (;;) { u = arc4random(); if ((u & 0xff00) == 0 || (u & 0x00ff) == 0 || (u & 0xff00) == 0 || (u & 0x00ff) == 0) break; count++; } printf("%08x\t%ld\n", u, count); return 0; } : bleys; cc f.c : bleys; ./a.out 00b82e5c58 ab47880036 : bleys; Philip Guenther
Re: Picky, but much more efficient arc4random_uniform!
Steffen Nurpmeso wrote in <20220517220924.xohqc%stef...@sdaoden.eu>: |Joerg Sonnenberger wrote in | : ||Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small: ||> I made a couple new versions of a new kind of arc4random_uniform-like ... |0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in |most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793 |NOT for Linux getrandom(2)), which is a pretty high cut off. ... /*@ lotto.cxx - output six random numbers */ #include #include #include //#define NYDPROF_ENABLE //#define NYD_ENABLE //#define NYD2_ENABLE #include NSPC_USE(su) static u32 bounded_rand(u32 range, u32 rv){ for(u32 t = -range % range;;){ if(rv >= t) return rv % range; log::write(log::emerg, "NOFIT"); } } int main(void){ u32 uni[50], mod[50], rv; state::create_core(NIL, (state::debug | log::debug), state::err_nopass); su_mem_set(uni, 0, sizeof uni); su_mem_set(mod, 0, sizeof mod); u32 au=0; for(u32 i = 0; i < 10; ++i){ random::builtin_generate(, sizeof rv, state::err_nopass); //if(getrandom(, sizeof rv, GRND_NONBLOCK) == -1) // log::write(log::emerg, "getrandom(2)"); if(rv < 0x01FF) log::write(log::info, "AU %u - 0x%X", ++au,rv); ++mod[rv % NELEM(mod)]; ++uni[bounded_rand(NELEM(mod), rv)]; } u32 unilo, modlo, unihi, modhi; unilo = modlo = max::u32; unihi = modhi = 0; for(u32 i = 0; i < NELEM(uni); ++i){ unilo = get_min(unilo, uni[i]); modlo = get_min(modlo, mod[i]); unihi = get_max(unihi, uni[i]); modhi = get_max(modhi, mod[i]); log::write(log::info, "%u - %u / %u %s", i, uni[i], mod[i], (uni[i] != mod[i] ? "!!!" : "=")); } log::write(log::info, "MIN %u / %u, MAX %u / %u - au %u", unilo, modlo, unihi, modhi,au); return 0; } #include // s-it-mode --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Picky, but much more efficient arc4random_uniform!
Joerg Sonnenberger wrote in : |Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small: |> I made a couple new versions of a new kind of arc4random_uniform-like ... |If your main use case is limiting the amount of cryptography when using |small bounds, there is a much simpler approach to be taken here. For |boundaries below 256, use arc4random_buf to extract one byte if bound is |a power of two, otherwise two. This gives most of the performance |benefit without complicating the algorithm. Extracting two bytes ensures |that the propability of success is > 99% and the double extracting |doesn't eat up the benefits. You can use (really implemented) _buf() if you need a 8-bit or 16-bit etc number. I find that _uniform() often makes no difference to a simple modulo because like the comment in _uniform() says "p > 0.5 (worst case, usually far better", and usually RNGs sprinkle bits nicely, 0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793 NOT for Linux getrandom(2)), which is a pretty high cut off. Using _uniform() just because of its name seems strange thus. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Picky, but much more efficient arc4random_uniform!
Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small: > I made a couple new versions of a new kind of arc4random_uniform-like > function and some example functions which use them. Instead of having a > sufficiently large random number greater than the modulus, I pick a random > number using arc4random() from a bitfield where the length of the bitfield > is just below or slightly beyond the value of the modulus and returns the > bitfield it if it is less than the value of the modulus. If your main use case is limiting the amount of cryptography when using small bounds, there is a much simpler approach to be taken here. For boundaries below 256, use arc4random_buf to extract one byte if bound is a power of two, otherwise two. This gives most of the performance benefit without complicating the algorithm. Extracting two bytes ensures that the propability of success is > 99% and the double extracting doesn't eat up the benefits. Joerg
Re: Picky, but much more efficient arc4random_uniform!
On Mon, 16 May 2022, Luke Small wrote: > Yeah, I see your point. > > I suppose it depends on how conservative you want to be and whether > you want to supply options to people like getchar_unlocked when it > isn’t essential. > > It could be made manually fork-safe if I could make a simple feature > where “arc4random_uniform_unlocked(0);” with a 0 upperbound could > trigger a reset of the static variables rand_bits and rand_holder > which would be simple enough and could be added to a man page. I > certainly read the man pages. (I’m surprised “man getchar” doesn’t > “see also” to getchar_unlocked() by the way.) > > If you look at profiling with programs that call it a lot, > arc4random() inside arc4random_uniform() calls the expensive rekey > function which makes it take more time. That’s why I can get around > 2X-3X performance on a typical repetitive small upperbound loop and > an extra 20% improvement on a single 65536 Knuth shuffle, loop even > though my function repeats a binary search for ‘bit’ size every single > time and has misses which demands calling up more data. > > Otherwise my function would be particularly useful when there’s a loop > with small upperbound like: 26+26+10 which, if I recall correctly, is > in identd, which would call it frequently. Just to bring this back to where we came in: even putting thread-safety aside (which you absolutely can't): it doesn't matter how much faster it is, your replacement function isn't useful until you do the work to demonstrate correctness. You have done literally zero so far, not even the bare minimum of testing the output. As a result your first version was demonstrated to be completely broken by literally the most basic of possible tests, a mere 10 lines of code. That you left this to others to do tells me that you fundamentally don't understand the environment in which you're trying to operate, and that you don't respect the time of the people you're trying to convince. Please stop wasting our time. -d
Re: Picky, but much more efficient arc4random_uniform!
Yeah, I see your point. I suppose it depends on how conservative you want to be and whether you want to supply options to people like getchar_unlocked when it isn’t essential. It could be made manually fork-safe if I could make a simple feature where “arc4random_uniform_unlocked(0);” with a 0 upperbound could trigger a reset of the static variables rand_bits and rand_holder which would be simple enough and could be added to a man page. I certainly read the man pages. (I’m surprised “man getchar” doesn’t “see also” to getchar_unlocked() by the way.) If you look at profiling with programs that call it a lot, arc4random() inside arc4random_uniform() calls the expensive rekey function which makes it take more time. That’s why I can get around 2X-3X performance on a typical repetitive small upperbound loop and an extra 20% improvement on a single 65536 Knuth shuffle, loop even though my function repeats a binary search for ‘bit’ size every single time and has misses which demands calling up more data. Otherwise my function would be particularly useful when there’s a loop with small upperbound like: 26+26+10 which, if I recall correctly, is in identd, which would call it frequently. I have a project that I generate MANY random record names much like that and I use fork()/fork()-exec() on many processes well before calling randomizing functions. Maybe I’m not the only one. I don’t want to even try multithreading, especially as often as you guys say that it is too unpredictable. I believe you. Using lua scripts in multi-worker-process redis calls to avoid race conditions is awkward enough for me. But otherwise it’s certainly your prerogative to not have it. At least you can’t legitimately say that it adds too much extra code or is too complex. It’s pretty simple to understand if you’re familiar with bitwise stuff. On Mon, May 16, 2022 at 5:35 PM Stuart Henderson wrote: > On 2022/05/16 15:13, Luke Small wrote: > > If you’re not running a threaded program, my function wouldn’t be “less > > safe.” > > > > I’d imagine that 99% of programs aren’t multithreaded. > > code is reused in different places. non threaded programs are sometimes > turned into threaded programs and when that happens, sometimes > non-thread-safe calls are missed. so i'd argue that it is still less > safe. > > in some cases there might be benefits that would mean it's worth it, > especially if the failure modes would be obvious such that they can > be detected. really not seeing that here. (how often are you even > calling arc4random_uniform to consider it slow?!) > > if the consequence is not a crash but instead subtly broken randomness, > how long do you think it's going to take to notice and report/fix it? > even *very* broken randomness in widespread software distributions > has been known to sit around for a long time before it's noticed: > > - predictable rng in a popular os. *serious* bug. introduced 2006, > discovered/reported nearly 2 years later. > > - non-fork-safe rng in a popular ssl library, introduced sometime before > sept 2018, reported may 2019. > > -- -Luke
Re: Picky, but much more efficient arc4random_uniform!
On 2022/05/16 15:13, Luke Small wrote: > If you’re not running a threaded program, my function wouldn’t be “less > safe.” > > I’d imagine that 99% of programs aren’t multithreaded. code is reused in different places. non threaded programs are sometimes turned into threaded programs and when that happens, sometimes non-thread-safe calls are missed. so i'd argue that it is still less safe. in some cases there might be benefits that would mean it's worth it, especially if the failure modes would be obvious such that they can be detected. really not seeing that here. (how often are you even calling arc4random_uniform to consider it slow?!) if the consequence is not a crash but instead subtly broken randomness, how long do you think it's going to take to notice and report/fix it? even *very* broken randomness in widespread software distributions has been known to sit around for a long time before it's noticed: - predictable rng in a popular os. *serious* bug. introduced 2006, discovered/reported nearly 2 years later. - non-fork-safe rng in a popular ssl library, introduced sometime before sept 2018, reported may 2019.
Re: Picky, but much more efficient arc4random_uniform!
No...am I incorrect, especially on OpenBSD? Of course since you made such a remark, you seem like the kind of fellow that would put the nail in the coffin for spite. ...now I sound like an asshole. On Mon, May 16, 2022 at 4:00 PM Theo de Raadt wrote: > hey luke you know you sound like an asshole right? > > > Luke Small wrote: > > > If you’re not running a threaded program, my function wouldn’t be “less > > safe.” > > > > I’d imagine that 99% of programs aren’t multithreaded. > > > > On Mon, May 16, 2022 at 1:01 PM wrote: > > > > > > There is the specifically non-threadsafe call getchar_unlocked() on > > > OpenBSD > > > > which is presumably available for performance reasons alone, when > > > getchar() > > > > is a perfectly viable option and is even an ISO conforming function. > > > What I > > > > submitted could be such a higher performance non-threadsafe function. > > > > > > > > so, how about arc4random_uniform_unlocked() ?! > > > > > > getchar_unlocked is mandated by POSIX. > > > > > > OpenBSD has not yet invented an alternate function that only exists to > > > give away safety for performance. It has only gone in the opposite > > > direction if anything. > > > > > > -- > > -Luke >
Re: Picky, but much more efficient arc4random_uniform!
hey luke you know you sound like an asshole right? Luke Small wrote: > If you’re not running a threaded program, my function wouldn’t be “less > safe.” > > I’d imagine that 99% of programs aren’t multithreaded. > > On Mon, May 16, 2022 at 1:01 PM wrote: > > > > There is the specifically non-threadsafe call getchar_unlocked() on > > OpenBSD > > > which is presumably available for performance reasons alone, when > > getchar() > > > is a perfectly viable option and is even an ISO conforming function. > > What I > > > submitted could be such a higher performance non-threadsafe function. > > > > > > so, how about arc4random_uniform_unlocked() ?! > > > > getchar_unlocked is mandated by POSIX. > > > > OpenBSD has not yet invented an alternate function that only exists to > > give away safety for performance. It has only gone in the opposite > > direction if anything. > > > > -- > -Luke
Re: Picky, but much more efficient arc4random_uniform!
If you’re not running a threaded program, my function wouldn’t be “less safe.” I’d imagine that 99% of programs aren’t multithreaded. On Mon, May 16, 2022 at 1:01 PM wrote: > > There is the specifically non-threadsafe call getchar_unlocked() on > OpenBSD > > which is presumably available for performance reasons alone, when > getchar() > > is a perfectly viable option and is even an ISO conforming function. > What I > > submitted could be such a higher performance non-threadsafe function. > > > > so, how about arc4random_uniform_unlocked() ?! > > getchar_unlocked is mandated by POSIX. > > OpenBSD has not yet invented an alternate function that only exists to > give away safety for performance. It has only gone in the opposite > direction if anything. > > -- -Luke
Re: Picky, but much more efficient arc4random_uniform!
Philip Guenther wrote in : |On Sun, 15 May 2022, Steffen Nurpmeso wrote: |> Stuart Henderson wrote in |... |>|what's the perceived problem you're wanting to solve? and does that |>|problem actually exist in the first place? |> |> The problem is that if have a low upper bound then modulo will "remove a |> lot of randomization". For example if you have a program which |> generates Lotto numbers (1..49), then using _uniform() as it is will |> generate many duplicates. | |Wut. The *WHOLE POINT* of arc4random_uniform() is that it has uniform |distribution. Says right so in the manpage. If an implementation of that |API fails to do that, it's a broken implementation. I always admired its source code comments and have been left stunning more than just once. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Picky, but much more efficient arc4random_uniform!
Luke Small on Sun, May 15 2022: > The current implementation is nothing more than a naive arc4random() % > upper_bound which trashes initial arc4random() calls it doesn’t like, Yes, that's exactly what it is. > The whole transformation by modulus of perfectly decent random data > seems so awkward. [...] it seems like an ugly HACK Let me point out that this way of doing it is absolutely standard. It is not awkward or a hack. > to simply meet demand This is probably a good way of describing the design goal for this highly sensitive piece of kernel interna. -p PS: I am not an OpenBSD developer, a tiny contributor at most. I speak as a bystander who happens to have some domain knowledge.
Re: Picky, but much more efficient arc4random_uniform!
Yeah. It most likely won't go in. From past experience and advice, not necessarily just from a perceived lack of merit. However, many, if not all of the arguments are based upon non-facts and misconceptions from earlier submissions or just not understanding what the software is doing. The only real thing that makes it not quite as good as the standard is mine isn't threadsafe. If it could be accepted as a higher performance, non-threadsafe call, it would perform better in many typical cases and perhaps even give a safer return value, especially in large upper_bound edge cases, I suspect. There is the specifically non-threadsafe call getchar_unlocked() on OpenBSD which is presumably available for performance reasons alone, when getchar() is a perfectly viable option and is even an ISO conforming function. What I submitted could be such a higher performance non-threadsafe function. so, how about arc4random_uniform_unlocked() ?! ...other than making upper_bound a uint32_t instead of the submitted uint64_t. That'd be somewhat of a problem. -Luke > You've had several developers tell you this is not going to go in. I'd > suggest > "read the room". > > If you want this for your own use, just keep it as a local diff. Nobody > will > know (or likely care). > > -ml >
Re: Picky, but much more efficient arc4random_uniform!
I’m not trying to be rude, but you don’t realize what’s going on here: uuu is a bitmask: ‘uuu’ (or (1 << bits)-1 ) in “ret = rand_holder & uuu;“ , only puts the lower ‘bit’ quantity of bits of rand_holder into ret, then it right shifts rand_holder afterward to trash them every time in the loop when it’s done. So if bits is 8, uuu is going to be 0xff No, you aren't: > > > for (;;) { > > if (rand_bits < bits) { > > rand_holder |= ((uint64_t)arc4random()) << > > rand_bits; > > > > /* > > * rand_bits will be a number between 0 and 31 > here > > * so the 0x20 bit will be empty > > * rand_bits += 32; > > */ > > rand_bits |= 32; > > } > > > > ret = rand_holder & uuu; > > rand_holder >>= bits; > > rand_bits -= bits; > > > > if (ret < upper_bound) > > return ret; > > } > > This isn't rejection sampling. This is reusing part of the rejected > sample. -- -Luke
Re: Picky, but much more efficient arc4random_uniform!
On Sun, May 15, 2022 at 08:40:19PM -0500, Luke Small wrote: > https://marc.info/?l=openbsd-tech=165259528425835=2 > > This one (which is strongly based upon my first of two versions) which I > submitted after Guenther correctly trashed version 2, doesn’t reuse any > part of the sample. It picks up a clean new bitfield upon failure. > > I think Guenther didn’t, perhaps like yourself, realize I submitted this > later program. That’s why he said it wasn’t correct. It didn’t occur to me > at the time of responding to him: “correct correct correct.” > You've had several developers tell you this is not going to go in. I'd suggest "read the room". If you want this for your own use, just keep it as a local diff. Nobody will know (or likely care). -ml > On Sun, May 15, 2022 at 7:47 PM Damien Miller wrote: > > > On Sat, 14 May 2022, Luke Small wrote: > > > > > Look at my code. I don’t even use a modulus operator. I perform hit and > > > miss with a random bitstream. > > > > > > How can I have a bias of something I don’t do? I return a bitstream which > > > meets the parameters of being a value less than the upper bound. Much > > like > > > arc4random_buf(). > > > > > > If I use arc4random_uniform() repeatedly to create a random distribution > > of > > > say numbers less than 0x1000 or even something weird like 0x1300 will the > > > random distribution be better with arc4random_uniform() or with mine? For > > > 0x1000 mine will simply pluck 12 bits of random data straight from the > > > arc4random() (and preserve the remaining 20 bits for later) on the first > > > try, just like it’s arc4random_buf(). > > > > > > arc4random_uniform() will perform a modulus of a 32 bit number which adds > > > data to the bitstream. Does it make it better? Perhaps it makes it harder > > > to guess the source bits. > > > > > > I don’t know; and I’m not going to pretend to be a cryptologist. But I’m > > > looking at modulo bias. > > > > > > I didn’t know what it was, before, but I basically “rejection sample”: > > > > > > > > https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/ > > > > No, you aren't: > > > > > for (;;) { > > > if (rand_bits < bits) { > > > rand_holder |= ((uint64_t)arc4random()) << > > > rand_bits; > > > > > > /* > > > * rand_bits will be a number between 0 and 31 > > here > > > * so the 0x20 bit will be empty > > > * rand_bits += 32; > > > */ > > > rand_bits |= 32; > > > } > > > > > > ret = rand_holder & uuu; > > > rand_holder >>= bits; > > > rand_bits -= bits; > > > > > > if (ret < upper_bound) > > > return ret; > > > } > > > > This isn't rejection sampling. This is reusing part of the rejected > > sample. > > > > Think of it like this: you want to uniformly generate a number in the > > range [2:10] by rolling 2x 6-sided dice. What do you do when you roll > > 11 or 12? You can't just reroll one of the dice because the other dice > > is constrained to be have rolled either 5 or 6, and so proceeding with > > it would force the output to be in the range [6:11] for these ~5.6% > > of initial rolls. Your output is no longer uniform. > > > > BTW the existing code already implements the prefered approach of the > > article you quoted. > > > > -d > > -- > -Luke
Re: Picky, but much more efficient arc4random_uniform!
https://marc.info/?l=openbsd-tech=165259528425835=2 This one (which is strongly based upon my first of two versions) which I submitted after Guenther correctly trashed version 2, doesn’t reuse any part of the sample. It picks up a clean new bitfield upon failure. I think Guenther didn’t, perhaps like yourself, realize I submitted this later program. That’s why he said it wasn’t correct. It didn’t occur to me at the time of responding to him: “correct correct correct.” On Sun, May 15, 2022 at 7:47 PM Damien Miller wrote: > On Sat, 14 May 2022, Luke Small wrote: > > > Look at my code. I don’t even use a modulus operator. I perform hit and > > miss with a random bitstream. > > > > How can I have a bias of something I don’t do? I return a bitstream which > > meets the parameters of being a value less than the upper bound. Much > like > > arc4random_buf(). > > > > If I use arc4random_uniform() repeatedly to create a random distribution > of > > say numbers less than 0x1000 or even something weird like 0x1300 will the > > random distribution be better with arc4random_uniform() or with mine? For > > 0x1000 mine will simply pluck 12 bits of random data straight from the > > arc4random() (and preserve the remaining 20 bits for later) on the first > > try, just like it’s arc4random_buf(). > > > > arc4random_uniform() will perform a modulus of a 32 bit number which adds > > data to the bitstream. Does it make it better? Perhaps it makes it harder > > to guess the source bits. > > > > I don’t know; and I’m not going to pretend to be a cryptologist. But I’m > > looking at modulo bias. > > > > I didn’t know what it was, before, but I basically “rejection sample”: > > > > > https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/ > > No, you aren't: > > > for (;;) { > > if (rand_bits < bits) { > > rand_holder |= ((uint64_t)arc4random()) << > > rand_bits; > > > > /* > > * rand_bits will be a number between 0 and 31 > here > > * so the 0x20 bit will be empty > > * rand_bits += 32; > > */ > > rand_bits |= 32; > > } > > > > ret = rand_holder & uuu; > > rand_holder >>= bits; > > rand_bits -= bits; > > > > if (ret < upper_bound) > > return ret; > > } > > This isn't rejection sampling. This is reusing part of the rejected > sample. > > Think of it like this: you want to uniformly generate a number in the > range [2:10] by rolling 2x 6-sided dice. What do you do when you roll > 11 or 12? You can't just reroll one of the dice because the other dice > is constrained to be have rolled either 5 or 6, and so proceeding with > it would force the output to be in the range [6:11] for these ~5.6% > of initial rolls. Your output is no longer uniform. > > BTW the existing code already implements the prefered approach of the > article you quoted. > > -d -- -Luke
Re: Picky, but much more efficient arc4random_uniform!
On Sat, 14 May 2022, Luke Small wrote: > Look at my code. I don’t even use a modulus operator. I perform hit and > miss with a random bitstream. > > How can I have a bias of something I don’t do? I return a bitstream which > meets the parameters of being a value less than the upper bound. Much like > arc4random_buf(). > > If I use arc4random_uniform() repeatedly to create a random distribution of > say numbers less than 0x1000 or even something weird like 0x1300 will the > random distribution be better with arc4random_uniform() or with mine? For > 0x1000 mine will simply pluck 12 bits of random data straight from the > arc4random() (and preserve the remaining 20 bits for later) on the first > try, just like it’s arc4random_buf(). > > arc4random_uniform() will perform a modulus of a 32 bit number which adds > data to the bitstream. Does it make it better? Perhaps it makes it harder > to guess the source bits. > > I don’t know; and I’m not going to pretend to be a cryptologist. But I’m > looking at modulo bias. > > I didn’t know what it was, before, but I basically “rejection sample”: > > https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/ No, you aren't: > for (;;) { > if (rand_bits < bits) { > rand_holder |= ((uint64_t)arc4random()) << > rand_bits; > > /* > * rand_bits will be a number between 0 and 31 here > * so the 0x20 bit will be empty > * rand_bits += 32; > */ > rand_bits |= 32; > } > > ret = rand_holder & uuu; > rand_holder >>= bits; > rand_bits -= bits; > > if (ret < upper_bound) > return ret; > } This isn't rejection sampling. This is reusing part of the rejected sample. Think of it like this: you want to uniformly generate a number in the range [2:10] by rolling 2x 6-sided dice. What do you do when you roll 11 or 12? You can't just reroll one of the dice because the other dice is constrained to be have rolled either 5 or 6, and so proceeding with it would force the output to be in the range [6:11] for these ~5.6% of initial rolls. Your output is no longer uniform. BTW the existing code already implements the prefered approach of the article you quoted. -d
Re: Picky, but much more efficient arc4random_uniform!
On Sun, 15 May 2022, Luke Small wrote: > Do I really have to use specific terminology to make a point? > > I'm not educated enough on chacha20 enough to know whether, like I > pointed out, whether choosing 5 bits from the middle of (or even from > the tail end of one and the beginning of another) 32 bit pseudorandom > cipher is "correct." You don't need to understand chacha20 to understand. arc4random_uniform() I certainly didn't when I wrote it. The underlying CSPRNG is irrelevant to how arc4random_uniform() works. It it treated as an oracle that provides 32 random bit upon request. You could swap it out for 32 coin-tossing monkeys and the implementation wouldn't need to change. It requests another 32 bit random value for each attempt at satisfying the bounds check because they need to be independent - reusing parts of a previous attempt is highly likely to introduce biases. It's almost certainly possible to make this function faster, but it's also very easy to get it wrong (e.g. I made one stupid math error in its early implementation, forever immortalised by CVS). The existing code has the advantage of being very obvious in how it works and therefore has a very low risk of being wrong. If someone is proposing to move to something less obvious then it's incumbent upon them to do the work to prove that their alternative is just as correct. > ...correct correct correct. Did I use that buzzword enough? Highly experienced people are taking he time to give you detailed, critical feedback. This can be hard to receive, but if you ever want to improve then you should consider it and try to engage constructively. -d
Re: Picky, but much more efficient arc4random_uniform!
Do I really have to use specific terminology to make a point? I'm not educated enough on chacha20 enough to know whether, like I pointed out, whether choosing 5 bits from the middle of (or even from the tail end of one and the beginning of another) 32 bit pseudorandom cipher is "correct." ...correct correct correct. Did I use that buzzword enough? -Luke On Sun, May 15, 2022 at 5:26 PM Philip Guenther wrote: > On Sun, 15 May 2022, Luke Small wrote: > > The current implementation is nothing more than a naive arc4random() % > > upper_bound which trashes initial arc4random() calls it doesn’t like, > then > > transforms over a desired modulus. The whole transformation by modulus of > > perfectly decent random data seems so awkward. It’s not like it is used > as > > some majestic artistry of RSA it seems like an ugly HACK to simply meet a > > demand lacking of something better. > > You fail to mention correctness at all or address the fact that your > version isn't while the current one is. Meanwhile, you talk about getting > only just enough random data as if there's some sort of limited supply > when there isn't. > > "My version may be wrong, but at least it doesn't look naive!" > > That is utterly the wrong attitude for OpenBSD code. > > > Best wishes. > > Philip Guenther >
Re: Picky, but much more efficient arc4random_uniform!
On Sun, 15 May 2022, Luke Small wrote: > The current implementation is nothing more than a naive arc4random() % > upper_bound which trashes initial arc4random() calls it doesn’t like, then > transforms over a desired modulus. The whole transformation by modulus of > perfectly decent random data seems so awkward. It’s not like it is used as > some majestic artistry of RSA it seems like an ugly HACK to simply meet a > demand lacking of something better. You fail to mention correctness at all or address the fact that your version isn't while the current one is. Meanwhile, you talk about getting only just enough random data as if there's some sort of limited supply when there isn't. "My version may be wrong, but at least it doesn't look naive!" That is utterly the wrong attitude for OpenBSD code. Best wishes. Philip Guenther
Re: Picky, but much more efficient arc4random_uniform!
The current implementation is nothing more than a naive arc4random() % upper_bound which trashes initial arc4random() calls it doesn’t like, then transforms over a desired modulus. The whole transformation by modulus of perfectly decent random data seems so awkward. It’s not like it is used as some majestic artistry of RSA it seems like an ugly HACK to simply meet a demand lacking of something better. If you understand what I’ve done, it streams in a bitfield into an integer type like it’s a buffer for just enough or slightly more data to meet the demands of the upperbound and if it exceeds upperbound-1, it is trashed and reads in a completely new bitfield to check. It relies on arc4random() supplying good random data regardless of how many bits are in the bitfield. If it does so, it should and seems to supply a good distribution across the length of the bitfield which may often be something like 5 for a common 26+26+10 upper_bound in /usr/src. It seems to me that it should be pretty good if not superior method; at least in the realm of cleaner results. Perhaps it’s confusing what I’ve done with all the bitwise operators, but it isn’t some random hacky thing I’ve cobbled together. Or does arc4random() only provide decent random data 32 bits at a time; or an even 8 bits at a time as arc4random_buf() would suggest? All I would have to prove is that chacha20 provides good or superior random bitfields regardless of how many bits are needed and regardless of whether they begin at the beginning of a byte. I don’t have the education for that, but “I got a ‘puter for Christmas.” lol. I can perhaps run simulations if I have nothing better to do. > I think I can say we know here uniformity is only *one* of the > desirable properties of a secure random generator. > > But I don't think anybody else execpt Luke was talking about > "improving". The sole purpose of arc4random_uniform() is to give a > good implementation of a random number function in a specific range > using arc4random() as the source. This is needed because the naive > implementation arc4random() % upper_bound is not uniform. > > -Otto > > > -- -Luke
Re: Picky, but much more efficient arc4random_uniform!
On Sun, May 15, 2022 at 08:57:09AM -0300, Crystal Kolipe wrote: > On Sun, May 15, 2022 at 11:44:29AM +0200, Otto Moerbeek wrote: > > On Sun, May 15, 2022 at 04:27:30AM -0500, Luke Small wrote: > > > How did someone prove the current implementation was cryptographically > > > sound? Did they run massive simulations which ran the gamut of the > > > uint32_t > > > range which demanded tight tolerances over varying length runs? > > > > > > How was rc4 cipher proven bad for pseudorandom numbers? I???d be willing > > > to > > > bet it wasn???t from a mathematical proof; it was from bad data. > > > You miss the point completely. The point is: how do you derive a > > uniformly distributed random function for a smaller range, given a > > uniformly distibuted random function over the range [0-2^32-1]. > > > > The current implementation does exactly that and has all the > > information in the comments to verify the uniformity claim. You only > > need to use basic mathematical properties of modulo arithmetic to > > do the verification. > > You do all realise that uniform distribution alone does not make a > good random number generator, don't you? > > For example: > > 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5... > > That's uniformly distributed, and also useless as a random number > stream. > > Further more, _cryptographically secure_ random number generation is > not the same as _mathematically good_ random number generation. > > There are plenty of random number generation formulas which are > considered good and useful from a mathematical basis, but which are > useless for cryptography. > > So random, (pun intended), hacks at the arc4random code are not > likely to 'improve' it from the general standpoint, (although if you > have a specific need for a specific private application, that's > different). I think Stuart has already more or less made that point. > I think I can say we know here uniformity is only *one* of the desirable properties of a secure random generator. But I don't think anybody else execpt Luke was talking about "improving". The sole purpose of arc4random_uniform() is to give a good implementation of a random number function in a specific range using arc4random() as the source. This is needed because the naive implementation arc4random() % upper_bound is not uniform. -Otto
Re: Picky, but much more efficient arc4random_uniform!
On Sun, May 15, 2022 at 11:44:29AM +0200, Otto Moerbeek wrote: > On Sun, May 15, 2022 at 04:27:30AM -0500, Luke Small wrote: > > How did someone prove the current implementation was cryptographically > > sound? Did they run massive simulations which ran the gamut of the uint32_t > > range which demanded tight tolerances over varying length runs? > > > > How was rc4 cipher proven bad for pseudorandom numbers? I???d be willing to > > bet it wasn???t from a mathematical proof; it was from bad data. > You miss the point completely. The point is: how do you derive a > uniformly distributed random function for a smaller range, given a > uniformly distibuted random function over the range [0-2^32-1]. > > The current implementation does exactly that and has all the > information in the comments to verify the uniformity claim. You only > need to use basic mathematical properties of modulo arithmetic to > do the verification. You do all realise that uniform distribution alone does not make a good random number generator, don't you? For example: 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5... That's uniformly distributed, and also useless as a random number stream. Further more, _cryptographically secure_ random number generation is not the same as _mathematically good_ random number generation. There are plenty of random number generation formulas which are considered good and useful from a mathematical basis, but which are useless for cryptography. So random, (pun intended), hacks at the arc4random code are not likely to 'improve' it from the general standpoint, (although if you have a specific need for a specific private application, that's different). I think Stuart has already more or less made that point.
Re: Picky, but much more efficient arc4random_uniform!
Random generators have been analyzed for years. Pick up "Concrete mathematics" by Don E. Knuth, read it, then come back with actual science.
Re: Picky, but much more efficient arc4random_uniform!
On Sun, May 15, 2022 at 04:27:30AM -0500, Luke Small wrote: > I have a feeling that making it threadsafe under any quantity of threads > and still perform is likely not feasible, but if I could; or if making a > nonthreadsafe variant was possible: > > Creating a mathematical proof that somehow is “simple and probably correct” > enough, would be an impossible task even for a PhD mathematician. > > How did someone prove the current implementation was cryptographically > sound? Did they run massive simulations which ran the gamut of the uint32_t > range which demanded tight tolerances over varying length runs? > > How was rc4 cipher proven bad for pseudorandom numbers? I’d be willing to > bet it wasn’t from a mathematical proof; it was from bad data. > > I’m guessing that large upperbounds approaching 2**32 don’t feed very > soundly in the current implementation using a modulus; although I suspect > that there isn’t much of a call for such things, I’m pretty sure I saw a > 3,000,000,000 upperbound in the src partition tonight. You miss the point completely. The point is: how do you derive a uniformly distributed random function for a smaller range, given a uniformly distibuted random function over the range [0-2^32-1]. The current implementation does exactly that and has all the information in the comments to verify the uniformity claim. You only need to use basic mathematical properties of modulo arithmetic to do the verification. -Otto > > On Sun, May 15, 2022 at 3:15 AM Otto Moerbeek wrote: > > > On Sun, May 15, 2022 at 01:12:28AM -0500, Luke Small wrote: > > > > > This is version 1, which I was pretty sure was secure. > > > > > > I revamped it with a few features and implanted the binary search for > > 'bit' > > > > > > in most cases, which aren't intentionally worst-case, it's pretty darned > > > fast! > > > > > > This is a sample run of my program with your piece of code included: > > > > > > > > > 1 99319 100239 > > > 2 100353 99584 > > > 3 100032 99879 > > > 4 99704 100229 > > > 5 99530 99796 > > > 6 99617 100355 > > > 7 100490 100319 > > > 8 99793 100114 > > > 9 100426 99744 > > > 10 100315 100558 > > > 11 99279 99766 > > > 12 99572 99750 > > > 13 99955 100017 > > > 14 100413 15 > > > 15 100190 100052 > > > 16 101071 100195 > > > 17 100322 100224 > > > 18 99637 99540 > > > 19 100323 99251 > > > 20 99841 100177 > > > 21 99948 99504 > > > 22 100065 100031 > > > 23 100026 99827 > > > 24 99836 99818 > > > 25 100245 99822 > > > 26 100088 99678 > > > 27 99957 3 > > > 28 100065 99961 > > > 29 100701 100679 > > > 30 99756 99587 > > > 31 100220 100076 > > > 32 100067 15 > > > 33 99547 99984 > > > 34 100124 100031 > > > 35 99547 100661 > > > 36 99801 99963 > > > 37 100189 100230 > > > 38 99878 99579 > > > 39 99864 99442 > > > 40 99683 14 > > > 41 99907 100094 > > > 42 100291 99817 > > > 43 99908 99984 > > > 44 100044 100606 > > > 45 100065 100120 > > > 46 99358 100141 > > > 47 100152 100442 > > > 48 10 100279 > > > 49 100486 99848 > > > > Sadly a sample run cannot be used to proof your implementation is > > correct. It can only be used to show it is not correct, like Philip > > did. To show your implementation produces uniform results in all > > cases, you need to provide a solid argumentation that is easy to > > follow. So far you failed to do that and I do not see it coming, given > > the complexituy of your implementation. The current implementation > > has a straightforward argumentation as it uses relatively simple > > mathematical properties of modulo arithmetic. > > > > It is also clear your code (as it uses statics) is not thread safe. > > > > So to answer you original question clearly: no, we will not accept this. > > > > -Otto > > > -- > -Luke
Re: Picky, but much more efficient arc4random_uniform!
I have a feeling that making it threadsafe under any quantity of threads and still perform is likely not feasible, but if I could; or if making a nonthreadsafe variant was possible: Creating a mathematical proof that somehow is “simple and probably correct” enough, would be an impossible task even for a PhD mathematician. How did someone prove the current implementation was cryptographically sound? Did they run massive simulations which ran the gamut of the uint32_t range which demanded tight tolerances over varying length runs? How was rc4 cipher proven bad for pseudorandom numbers? I’d be willing to bet it wasn’t from a mathematical proof; it was from bad data. I’m guessing that large upperbounds approaching 2**32 don’t feed very soundly in the current implementation using a modulus; although I suspect that there isn’t much of a call for such things, I’m pretty sure I saw a 3,000,000,000 upperbound in the src partition tonight. On Sun, May 15, 2022 at 3:15 AM Otto Moerbeek wrote: > On Sun, May 15, 2022 at 01:12:28AM -0500, Luke Small wrote: > > > This is version 1, which I was pretty sure was secure. > > > > I revamped it with a few features and implanted the binary search for > 'bit' > > > > in most cases, which aren't intentionally worst-case, it's pretty darned > > fast! > > > > This is a sample run of my program with your piece of code included: > > > > > > 1 99319 100239 > > 2 100353 99584 > > 3 100032 99879 > > 4 99704 100229 > > 5 99530 99796 > > 6 99617 100355 > > 7 100490 100319 > > 8 99793 100114 > > 9 100426 99744 > > 10 100315 100558 > > 11 99279 99766 > > 12 99572 99750 > > 13 99955 100017 > > 14 100413 15 > > 15 100190 100052 > > 16 101071 100195 > > 17 100322 100224 > > 18 99637 99540 > > 19 100323 99251 > > 20 99841 100177 > > 21 99948 99504 > > 22 100065 100031 > > 23 100026 99827 > > 24 99836 99818 > > 25 100245 99822 > > 26 100088 99678 > > 27 99957 3 > > 28 100065 99961 > > 29 100701 100679 > > 30 99756 99587 > > 31 100220 100076 > > 32 100067 15 > > 33 99547 99984 > > 34 100124 100031 > > 35 99547 100661 > > 36 99801 99963 > > 37 100189 100230 > > 38 99878 99579 > > 39 99864 99442 > > 40 99683 14 > > 41 99907 100094 > > 42 100291 99817 > > 43 99908 99984 > > 44 100044 100606 > > 45 100065 100120 > > 46 99358 100141 > > 47 100152 100442 > > 48 10 100279 > > 49 100486 99848 > > Sadly a sample run cannot be used to proof your implementation is > correct. It can only be used to show it is not correct, like Philip > did. To show your implementation produces uniform results in all > cases, you need to provide a solid argumentation that is easy to > follow. So far you failed to do that and I do not see it coming, given > the complexituy of your implementation. The current implementation > has a straightforward argumentation as it uses relatively simple > mathematical properties of modulo arithmetic. > > It is also clear your code (as it uses statics) is not thread safe. > > So to answer you original question clearly: no, we will not accept this. > > -Otto > -- -Luke
Re: Picky, but much more efficient arc4random_uniform!
On Sun, May 15, 2022 at 01:12:28AM -0500, Luke Small wrote: > This is version 1, which I was pretty sure was secure. > > I revamped it with a few features and implanted the binary search for 'bit' > > in most cases, which aren't intentionally worst-case, it's pretty darned > fast! > > This is a sample run of my program with your piece of code included: > > > 1 99319 100239 > 2 100353 99584 > 3 100032 99879 > 4 99704 100229 > 5 99530 99796 > 6 99617 100355 > 7 100490 100319 > 8 99793 100114 > 9 100426 99744 > 10 100315 100558 > 11 99279 99766 > 12 99572 99750 > 13 99955 100017 > 14 100413 15 > 15 100190 100052 > 16 101071 100195 > 17 100322 100224 > 18 99637 99540 > 19 100323 99251 > 20 99841 100177 > 21 99948 99504 > 22 100065 100031 > 23 100026 99827 > 24 99836 99818 > 25 100245 99822 > 26 100088 99678 > 27 99957 3 > 28 100065 99961 > 29 100701 100679 > 30 99756 99587 > 31 100220 100076 > 32 100067 15 > 33 99547 99984 > 34 100124 100031 > 35 99547 100661 > 36 99801 99963 > 37 100189 100230 > 38 99878 99579 > 39 99864 99442 > 40 99683 14 > 41 99907 100094 > 42 100291 99817 > 43 99908 99984 > 44 100044 100606 > 45 100065 100120 > 46 99358 100141 > 47 100152 100442 > 48 10 100279 > 49 100486 99848 Sadly a sample run cannot be used to proof your implementation is correct. It can only be used to show it is not correct, like Philip did. To show your implementation produces uniform results in all cases, you need to provide a solid argumentation that is easy to follow. So far you failed to do that and I do not see it coming, given the complexituy of your implementation. The current implementation has a straightforward argumentation as it uses relatively simple mathematical properties of modulo arithmetic. It is also clear your code (as it uses statics) is not thread safe. So to answer you original question clearly: no, we will not accept this. -Otto
Re: Picky, but much more efficient arc4random_uniform!
This is version 1, which I was pretty sure was secure. I revamped it with a few features and implanted the binary search for 'bit' in most cases, which aren't intentionally worst-case, it's pretty darned fast! This is a sample run of my program with your piece of code included: 1 99319 100239 2 100353 99584 3 100032 99879 4 99704 100229 5 99530 99796 6 99617 100355 7 100490 100319 8 99793 100114 9 100426 99744 10 100315 100558 11 99279 99766 12 99572 99750 13 99955 100017 14 100413 15 15 100190 100052 16 101071 100195 17 100322 100224 18 99637 99540 19 100323 99251 20 99841 100177 21 99948 99504 22 100065 100031 23 100026 99827 24 99836 99818 25 100245 99822 26 100088 99678 27 99957 3 28 100065 99961 29 100701 100679 30 99756 99587 31 100220 100076 32 100067 15 33 99547 99984 34 100124 100031 35 99547 100661 36 99801 99963 37 100189 100230 38 99878 99579 39 99864 99442 40 99683 14 41 99907 100094 42 100291 99817 43 99908 99984 44 100044 100606 45 100065 100120 46 99358 100141 47 100152 100442 48 10 100279 49 100486 99848 newdata time = 0.240530757 seconds newdata_fast time = 0.073868626 seconds The original takes 325.620% of the runtime of newdata_fast newdataTypableFilename time = 0.236748989 seconds newdataTypableFilename_fast time = 0.115842333 seconds The original takes 204.372% of the runtime of newdataTypableFilename_fast rand_ideal time = 0.235832820 seconds rand_ideal_fast time = 0.025300798 seconds The original takes 932.116% of the runtime of rand_ideal_fast rand_short_ideal time = 0.236828684 seconds rand_short_ideal_fast time = 0.124025922 seconds The original takes 190.951% of the runtime of rand_short_ideal_fast rand_short_worst time = 0.237142869 seconds rand_short_worst_fast time = 0.294156486 seconds The original takes 80.618% of the runtime of rand_short_worst_fast rand_worst time = 0.237002775 seconds rand_worst_fast time = 0.377148420 seconds The original takes 62.841% of the runtime of rand_worst_fast shuffle time = 0.003044472 seconds shuffle_fast time = 0.002590664 seconds The original takes 117.517% of the runtime of shuffle_fast If it crashed here, you are trying to profile. Turn off pledge at the beginning of main(). On Sat, May 14, 2022 at 7:03 PM Philip Guenther wrote: > On Sun, 15 May 2022, Steffen Nurpmeso wrote: > > Stuart Henderson wrote in > ... > > |what's the perceived problem you're wanting to solve? and does that > > |problem actually exist in the first place? > > > > The problem is that if have a low upper bound then modulo will "remove a > > lot of randomization". For example if you have a program which > > generates Lotto numbers (1..49), then using _uniform() as it is will > > generate many duplicates. > > Wut. The *WHOLE POINT* of arc4random_uniform() is that it has uniform > distribution. Says right so in the manpage. If an implementation of that > API fails to do that, it's a broken implementation. > > The OpenBSD implementation achieves that. NetBSD's implementation has the > same core logic. Here's my quick lotto pick demo program, doing 490 > picks, so they should all be somewhere near 10: > > #include > #include > #define LIMIT 49 > int > main(void) > { > unsigned counts[LIMIT] = { 0 }; > int i; > for (i = 0; i < 10 * LIMIT; i++) > counts[arc4random_uniform(LIMIT)]++; > for (i = 0; i < LIMIT; i++) > printf("%2d\t%7u\n", i+1, counts[i]); > return 0; > } > > And a sample run: > > : bleys; ./a.out > > 1 100100 > 2 100639 > 399965 > 499729 > 599641 > 699650 > 7 100299 > 8 100164 > 999791 > 10 100101 > 11 100657 > 12 100042 > 1399661 > 1499927 > 1599426 > 1699491 > 1799646 > 18 100133 > 19 100013 > 2099942 > 2199873 > 2299924 > 2399567 > 24 100152 > 25 100688 > 26 100011 > 27 100481 > 2899980 > 29 100406 > 3099726 > 3199808 > 3299929 > 33 100050 > 3499983 > 35 100048 > 3699771 > 3799906 > 38 100215 > 39 100261 > 40 100426 > 4199847 > 4299533 > 43 100368 > 4499695 > 45 100041 > 46 100465 > 4799875 > 48 100034 > 4999920 > : bleys; > > Looks pretty good to me, with repeated runs showing the values bouncing > around. > > > ... > > I was a bit interested when i saw Luke's message, but i am no > > mathematician and, like i said, i in fact never needed the > > functionality. And the question i would have is how small > > upper_bound has to be for the modulo problem to really kick in. > > If the implementation isn't broken, it's not a problem, period. > > > > And even if, whether it
Re: Picky, but much more efficient arc4random_uniform!
On Sat, May 14, 2022 at 05:03:17PM -0700, Philip Guenther wrote: > On Sun, 15 May 2022, Steffen Nurpmeso wrote: > > Stuart Henderson wrote in > ... > > |what's the perceived problem you're wanting to solve? and does that > > |problem actually exist in the first place? > > > > The problem is that if have a low upper bound then modulo will "remove a > > lot of randomization". For example if you have a program which > > generates Lotto numbers (1..49), then using _uniform() as it is will > > generate many duplicates. > > Wut. The *WHOLE POINT* of arc4random_uniform() is that it has uniform > distribution. Says right so in the manpage. If an implementation of that > API fails to do that, it's a broken implementation. > > The OpenBSD implementation achieves that. NetBSD's implementation has the > same core logic. Here's my quick lotto pick demo program, doing 490 > picks, so they should all be somewhere near 10: > > #include > #include > #define LIMIT 49 > int > main(void) > { > unsigned counts[LIMIT] = { 0 }; > int i; > for (i = 0; i < 10 * LIMIT; i++) > counts[arc4random_uniform(LIMIT)]++; > for (i = 0; i < LIMIT; i++) > printf("%2d\t%7u\n", i+1, counts[i]); > return 0; > } > > And a sample run: > > : bleys; ./a.out > > 1 100100 > 2 100639 > 399965 > 499729 > 599641 > 699650 > 7 100299 > 8 100164 > 999791 > 10 100101 > 11 100657 > 12 100042 > 1399661 > 1499927 > 1599426 > 1699491 > 1799646 > 18 100133 > 19 100013 > 2099942 > 2199873 > 2299924 > 2399567 > 24 100152 > 25 100688 > 26 100011 > 27 100481 > 2899980 > 29 100406 > 3099726 > 3199808 > 3299929 > 33 100050 > 3499983 > 35 100048 > 3699771 > 3799906 > 38 100215 > 39 100261 > 40 100426 > 4199847 > 4299533 > 43 100368 > 4499695 > 45 100041 > 46 100465 > 4799875 > 48 100034 > 4999920 > : bleys; > > Looks pretty good to me, with repeated runs showing the values bouncing > around. > > > ... > > I was a bit interested when i saw Luke's message, but i am no > > mathematician and, like i said, i in fact never needed the > > functionality. And the question i would have is how small > > upper_bound has to be for the modulo problem to really kick in. > > If the implementation isn't broken, it's not a problem, period. > > > > And even if, whether it matters. For example, if you have > > a bitset of 1024 "in-flight things", and the small upper bound > > hits a used slot, you could simply search forward until you find > > a free one. I mean, with 1024 possible values the number of > > possibilities is anyhow restricted. > > Well hey, let's give Luke's a try and see how uniform it is: > > > #include > #include > #define LIMIT 49 > int > main(void) > { > unsigned counts[LIMIT] = { 0 }; > unsigned counts2[LIMIT] = { 0 }; > int i; > for (i = 0; i < 10 * LIMIT; i++) { > counts[arc4random_uniform(LIMIT)]++; > counts2[arc4random_uniform_fast_simple(LIMIT)]++; > } > for (i = 0; i < LIMIT; i++) > printf("%2d\t%7u\t%7u\n", i+1, counts[i], counts2[i]); > return 0; > } > > : bleys; ./a.out > 1 100188 76502 > 299983 76602 > 3 100521 76522 > 4 100416 76682 > 5 100171 76387 > 6 100163 76759 > 7 100024 76336 > 8 19 76703 > 999769 76237 > 1099892 76532 > 11 100197 76730 > 12 100483 76398 > 1399769 76310 > 14 100075 76474 > 1599781 76599 > 1699846 76439 > 1799814 76430 > 18 100313 76648 > 19 100259 76813 > 2099885 77068 > 21 100302 76546 > 228 76698 > 2399491 76678 > 24 100340 76324 > 2599763 115263 > 2699872 153008 > 27 100022 152979 > 2899481 153793 > 29 100018 210714 > 3099617 229286 > 31 100167 297003 > 32 100270 449664 > 33 100468 76790 > 3499115 76452 > 3599921 76392 > 3699862 76140 > 37 100485 76607 > 38 100029 75885 > 3999577 76498 > 4099479 76727 > 41 100139 76746 > 42 100883 76698 > 43 100102 76474 > 4499801 76592 > 45 100117 76124 > 4699678 76417 > 4799770 76639 > 4899524 77034 > 49 100151 76658 > : bleys; > > Wow, that last column is *bad*.
Re: Picky, but much more efficient arc4random_uniform!
On Sun, 15 May 2022, Steffen Nurpmeso wrote: > Stuart Henderson wrote in ... > |what's the perceived problem you're wanting to solve? and does that > |problem actually exist in the first place? > > The problem is that if have a low upper bound then modulo will "remove a > lot of randomization". For example if you have a program which > generates Lotto numbers (1..49), then using _uniform() as it is will > generate many duplicates. Wut. The *WHOLE POINT* of arc4random_uniform() is that it has uniform distribution. Says right so in the manpage. If an implementation of that API fails to do that, it's a broken implementation. The OpenBSD implementation achieves that. NetBSD's implementation has the same core logic. Here's my quick lotto pick demo program, doing 490 picks, so they should all be somewhere near 10: #include #include #define LIMIT 49 int main(void) { unsigned counts[LIMIT] = { 0 }; int i; for (i = 0; i < 10 * LIMIT; i++) counts[arc4random_uniform(LIMIT)]++; for (i = 0; i < LIMIT; i++) printf("%2d\t%7u\n", i+1, counts[i]); return 0; } And a sample run: : bleys; ./a.out 1 100100 2 100639 399965 499729 599641 699650 7 100299 8 100164 999791 10 100101 11 100657 12 100042 1399661 1499927 1599426 1699491 1799646 18 100133 19 100013 2099942 2199873 2299924 2399567 24 100152 25 100688 26 100011 27 100481 2899980 29 100406 3099726 3199808 3299929 33 100050 3499983 35 100048 3699771 3799906 38 100215 39 100261 40 100426 4199847 4299533 43 100368 4499695 45 100041 46 100465 4799875 48 100034 4999920 : bleys; Looks pretty good to me, with repeated runs showing the values bouncing around. ... > I was a bit interested when i saw Luke's message, but i am no > mathematician and, like i said, i in fact never needed the > functionality. And the question i would have is how small > upper_bound has to be for the modulo problem to really kick in. If the implementation isn't broken, it's not a problem, period. > And even if, whether it matters. For example, if you have > a bitset of 1024 "in-flight things", and the small upper bound > hits a used slot, you could simply search forward until you find > a free one. I mean, with 1024 possible values the number of > possibilities is anyhow restricted. Well hey, let's give Luke's a try and see how uniform it is: #include #include #define LIMIT 49 int main(void) { unsigned counts[LIMIT] = { 0 }; unsigned counts2[LIMIT] = { 0 }; int i; for (i = 0; i < 10 * LIMIT; i++) { counts[arc4random_uniform(LIMIT)]++; counts2[arc4random_uniform_fast_simple(LIMIT)]++; } for (i = 0; i < LIMIT; i++) printf("%2d\t%7u\t%7u\n", i+1, counts[i], counts2[i]); return 0; } : bleys; ./a.out 1 100188 76502 299983 76602 3 100521 76522 4 100416 76682 5 100171 76387 6 100163 76759 7 100024 76336 8 19 76703 999769 76237 1099892 76532 11 100197 76730 12 100483 76398 1399769 76310 14 100075 76474 1599781 76599 1699846 76439 1799814 76430 18 100313 76648 19 100259 76813 2099885 77068 21 100302 76546 228 76698 2399491 76678 24 100340 76324 2599763 115263 2699872 153008 27 100022 152979 2899481 153793 29 100018 210714 3099617 229286 31 100167 297003 32 100270 449664 33 100468 76790 3499115 76452 3599921 76392 3699862 76140 37 100485 76607 38 100029 75885 3999577 76498 4099479 76727 41 100139 76746 42 100883 76698 43 100102 76474 4499801 76592 45 100117 76124 4699678 76417 4799770 76639 4899524 77034 49 100151 76658 : bleys; Wow, that last column is *bad*. Repeated runs show 25-32 are consistently high, 32 being *outrageously* off, and the others are low. Luke's implementation does not correctly implement the API. Doesn't matter if it's a million times faster when it doesn't deliver the goal. Philip Guenther
Re: Picky, but much more efficient arc4random_uniform!
Stuart Henderson wrote in : |On 2022/05/14 06:56, Luke Small wrote: |> If I use arc4random_uniform() repeatedly to create a random distribution \ |> of |> say numbers less than 0x1000 or even something weird like 0x1300 will the |> random distribution be better with arc4random_uniform() or with mine? | |there's no point to have a choice of different arc4random_uniform_xyz |with different characteristics, how is somebody going to pick the |"right" one? | |the bottom of netbsd's version of the arc4random(3) manual says it well: | | One may be tempted to create new APIs to accommodate different \ | security | models and performance constraints without unpleasant surprises \ | on dif- | ferent operating systems. This should not be done lightly, though, | because there are already too many different choices, and too \ | many oppor- | tunities for programmers to reach for one and pick the wrong one. | |what's the perceived problem you're wanting to solve? and does that |problem actually exist in the first place? The problem is that if have a low upper bound then modulo will "remove a lot of randomization". For example if you have a program which generates Lotto numbers (1..49), then using _uniform() as it is will generate many duplicates. I used the approach i found in a TeX file over twenty years ago ("RANDOM.TEX, v.1 (Donald Arseneau) from TeTeX, texmf/tex/generic/misc/random.tex"; and now that i look, he published a .v2 not too long ago), it was named _clip() as it produces something in between a minimum and a maximum: _max -= _min; ++_max; _min = Maximum::sir - 2; // m - 2 = 2^(32|64) - 3 if (_max != 0) _min /= _max; for(;;) { uir i; i = _random; i += M1::uir; if(_min != 0) i /= _min; if(i >= _max && (i != 0 || _max != 0)) { _random *= 16807; // just MUL if(_random == 0) _random = 1; continue; } _random = i; break; } _random += origmin; _Assert(1 && _random >= origmin && _random <= origmax); That is the random number is multiplied with a, eh, non-prime, as long as we do not fit. However i found this became a terrible mess with 64-bit numbers, so i removed the function, aka did not "port it over" when i implemented a random number generator for my own use cases. Maybe i should have increased the multiplicator. It is really, really bad. :-) I only ever used it for a Lotto number program anyhow, i think, and so i thought it would be better to simply read a one byte random and skip those that do not fit, instead of making them always fit with modulo 50. That is, application specific workaround. Things are less harmful for large numbers, anyway. In fact i have to say i was astonished to see someone using this function, not too long ago i saw a FreeBSD commit fly by which uses it! I was a bit interested when i saw Luke's message, but i am no mathematician and, like i said, i in fact never needed the functionality. And the question i would have is how small upper_bound has to be for the modulo problem to really kick in. And even if, whether it matters. For example, if you have a bitset of 1024 "in-flight things", and the small upper bound hits a used slot, you could simply search forward until you find a free one. I mean, with 1024 possible values the number of possibilities is anyhow restricted. --steffen | |Der Kragenbaer,The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt)
Re: Picky, but much more efficient arc4random_uniform!
int main() { int results[3] = { 0, 0, 0 }; for (int i = 0; i < 10; i++) { results[arc4random_uniform_fast_simple(3)]++; } for (int i = 0; i < 3; i++) printf("%d: %d\n", i, results[i]); return 0; } % ./a.out 0: 24809 1: 50011 2: 25180 You can't reuse bits because they'll be biased.
Re: Picky, but much more efficient arc4random_uniform!
On 2022/05/14 06:56, Luke Small wrote: > If I use arc4random_uniform() repeatedly to create a random distribution of > say numbers less than 0x1000 or even something weird like 0x1300 will the > random distribution be better with arc4random_uniform() or with mine? there's no point to have a choice of different arc4random_uniform_xyz with different characteristics, how is somebody going to pick the "right" one? the bottom of netbsd's version of the arc4random(3) manual says it well: One may be tempted to create new APIs to accommodate different security models and performance constraints without unpleasant surprises on dif- ferent operating systems. This should not be done lightly, though, because there are already too many different choices, and too many oppor- tunities for programmers to reach for one and pick the wrong one. what's the perceived problem you're wanting to solve? and does that problem actually exist in the first place?
Re: Picky, but much more efficient arc4random_uniform!
Look at my code. I don’t even use a modulus operator. I perform hit and miss with a random bitstream. How can I have a bias of something I don’t do? I return a bitstream which meets the parameters of being a value less than the upper bound. Much like arc4random_buf(). If I use arc4random_uniform() repeatedly to create a random distribution of say numbers less than 0x1000 or even something weird like 0x1300 will the random distribution be better with arc4random_uniform() or with mine? For 0x1000 mine will simply pluck 12 bits of random data straight from the arc4random() (and preserve the remaining 20 bits for later) on the first try, just like it’s arc4random_buf(). arc4random_uniform() will perform a modulus of a 32 bit number which adds data to the bitstream. Does it make it better? Perhaps it makes it harder to guess the source bits. I don’t know; and I’m not going to pretend to be a cryptologist. But I’m looking at modulo bias. I didn’t know what it was, before, but I basically “rejection sample”: https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/ On Sat, May 14, 2022 at 6:14 AM Otto Moerbeek wrote: > On Sat, May 14, 2022 at 05:48:10AM -0500, Luke Small wrote: > > > arc4random_uniform_fast2 that I made, streams in data from arc4random() > and > > uses the datastream directly and uses it as a bit by bit right "sliding > > window" in the last loop. arc4random_uniform() uses a modulus which I is > > simple to implement, but I wonder how cryptographically sound or even how > > evenly it distributes. Adding a modulus seems sloppy without something > > better. I did make arc4random_fast_simple() which merely takes an > > upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever > > the top function was into it which binary searches to find the correct > size > > bitfield (return value) needed to barely fit the upperbound while also > > being able to discover every possible value below the upperbound. It > isn't > > as fast as arc4random_uniform_fast2 if it were used repeatedly after a > > single use of arc4random_uniform_fast_bitsearch() , but it does exactly > the > > same thing and appears faster than repeatedly using arc4random_uniform() > > and it's wasteful use of arc4random() and calling the expensive rekeying > > function more often. > > > > It may be interesting to determine even without looking at performance, > > whether arc4random_fast_simple() creates a more superior, secure use of > the > > chacha20 stream than arc4random_uniform() with the modulus. what exactly > > does all that extra data from the modulus do to the random distribution? > > > > -Luke > > I don't follow you at all. Your blabbering does not even use the terms > "uniform" and "modulo bias". I wonder even if you realize what they > mean in this context. > > -Otto > > -- -Luke
Re: Picky, but much more efficient arc4random_uniform!
On Sat, May 14, 2022 at 05:48:10AM -0500, Luke Small wrote: > arc4random_uniform_fast2 that I made, streams in data from arc4random() and > uses the datastream directly and uses it as a bit by bit right "sliding > window" in the last loop. arc4random_uniform() uses a modulus which I is > simple to implement, but I wonder how cryptographically sound or even how > evenly it distributes. Adding a modulus seems sloppy without something > better. I did make arc4random_fast_simple() which merely takes an > upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever > the top function was into it which binary searches to find the correct size > bitfield (return value) needed to barely fit the upperbound while also > being able to discover every possible value below the upperbound. It isn't > as fast as arc4random_uniform_fast2 if it were used repeatedly after a > single use of arc4random_uniform_fast_bitsearch() , but it does exactly the > same thing and appears faster than repeatedly using arc4random_uniform() > and it's wasteful use of arc4random() and calling the expensive rekeying > function more often. > > It may be interesting to determine even without looking at performance, > whether arc4random_fast_simple() creates a more superior, secure use of the > chacha20 stream than arc4random_uniform() with the modulus. what exactly > does all that extra data from the modulus do to the random distribution? > > -Luke I don't follow you at all. Your blabbering does not even use the terms "uniform" and "modulo bias". I wonder even if you realize what they mean in this context. -Otto
Re: Picky, but much more efficient arc4random_uniform!
arc4random_uniform_fast2 that I made, streams in data from arc4random() and uses the datastream directly and uses it as a bit by bit right "sliding window" in the last loop. arc4random_uniform() uses a modulus which I is simple to implement, but I wonder how cryptographically sound or even how evenly it distributes. Adding a modulus seems sloppy without something better. I did make arc4random_fast_simple() which merely takes an upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever the top function was into it which binary searches to find the correct size bitfield (return value) needed to barely fit the upperbound while also being able to discover every possible value below the upperbound. It isn't as fast as arc4random_uniform_fast2 if it were used repeatedly after a single use of arc4random_uniform_fast_bitsearch() , but it does exactly the same thing and appears faster than repeatedly using arc4random_uniform() and it's wasteful use of arc4random() and calling the expensive rekeying function more often. It may be interesting to determine even without looking at performance, whether arc4random_fast_simple() creates a more superior, secure use of the chacha20 stream than arc4random_uniform() with the modulus. what exactly does all that extra data from the modulus do to the random distribution? -Luke
Re: Picky, but much more efficient arc4random_uniform!
In general, wasting some arc4random bits is not a problem at all. I see a lot of code with no demonstration, explanation or proof why it would be correct (i.e. produces uniform results). You only talk about speed. The current code might not be optimal, but at least it is very clear it's correct. -Otto On Fri, May 13, 2022 at 09:43:26AM -0500, Luke Small wrote: > I made a couple new versions of a new kind of arc4random_uniform-like > function and some example functions which use them. Instead of having a > sufficiently large random number greater than the modulus, I pick a random > number using arc4random() from a bitfield where the length of the bitfield > is just below or slightly beyond the value of the modulus and returns the > bitfield it if it is less than the value of the modulus. > > In both versions, I retrieve a bitfield from a static uint64_t which is > refreshed from periodic arc4random() calls. The functions demand a bit > length. but I provide a convenient bitsize search function: > arc4random_uniform_fast_bitsearch() > > in the first version, if the chosen return value isn't less than the > modulus, the entire bitfield is trashed and a completely new bitfield is > refreshed from the cache. This can be very inefficient and for certain > upperbounds where the functions demand that all returned values less than > the upperbound are possible. This can perform worse than > arc4random_uniform() even with more random number generation churn. > > in the second version, if the chosen return value isn't less than the > modulus, the bitfield is shifted right, losing the least significant bit > and a new random-value bit from the least significant bit of the cache is > put into the most significant bit of the bitfield and it tries again. For > significant expenditures of effort, this version is always more efficient > than arc4random_uniform() and slightly to much more efficient than my first > version. > > > The performance boost comes from less pseudorandom number generation by not > trashing perfectly good random data and preserving it so that rekeying is > performed less. > > > I suspect that the first version may be secure, but I'm now sure what you > think of the second version, for being secure. Is there a way to test how > well distributed and secure it is?! > > Perhaps it's even better than the typical use case of arc4random_uniform > since it feeds it a bitstream instead of performing a modulous operation on > it! > > I pledge("stdio") and unveil(NULL, NULL) at the beginning, so you know it > doesn't do anything but malloc() an array, do stuff on the array and print > stuff; if you don't profile, anyway. > > What do you think of having such a thing in OpenBSD? > > > > > newdata() generates random a-z characters performing arc4random_uniform(26); > > newdataTypableFilename() generates random filename typable characters > performing arc4random_uniform(92); > > I perform other tests including presumed worst-cases on large numbers and > numbers which will have a lot of misses. > > > > > -Luke
Picky, but much more efficient arc4random_uniform!
I made a couple new versions of a new kind of arc4random_uniform-like function and some example functions which use them. Instead of having a sufficiently large random number greater than the modulus, I pick a random number using arc4random() from a bitfield where the length of the bitfield is just below or slightly beyond the value of the modulus and returns the bitfield it if it is less than the value of the modulus. In both versions, I retrieve a bitfield from a static uint64_t which is refreshed from periodic arc4random() calls. The functions demand a bit length. but I provide a convenient bitsize search function: arc4random_uniform_fast_bitsearch() in the first version, if the chosen return value isn't less than the modulus, the entire bitfield is trashed and a completely new bitfield is refreshed from the cache. This can be very inefficient and for certain upperbounds where the functions demand that all returned values less than the upperbound are possible. This can perform worse than arc4random_uniform() even with more random number generation churn. in the second version, if the chosen return value isn't less than the modulus, the bitfield is shifted right, losing the least significant bit and a new random-value bit from the least significant bit of the cache is put into the most significant bit of the bitfield and it tries again. For significant expenditures of effort, this version is always more efficient than arc4random_uniform() and slightly to much more efficient than my first version. The performance boost comes from less pseudorandom number generation by not trashing perfectly good random data and preserving it so that rekeying is performed less. I suspect that the first version may be secure, but I'm now sure what you think of the second version, for being secure. Is there a way to test how well distributed and secure it is?! Perhaps it's even better than the typical use case of arc4random_uniform since it feeds it a bitstream instead of performing a modulous operation on it! I pledge("stdio") and unveil(NULL, NULL) at the beginning, so you know it doesn't do anything but malloc() an array, do stuff on the array and print stuff; if you don't profile, anyway. What do you think of having such a thing in OpenBSD? newdata() generates random a-z characters performing arc4random_uniform(26); newdataTypableFilename() generates random filename typable characters performing arc4random_uniform(92); I perform other tests including presumed worst-cases on large numbers and numbers which will have a lot of misses. -Luke arc4random_uniform_fast.c Description: Binary data