Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Theo de Raadt
We will not be using your code.

End of story.


Luke Small  wrote:

> Marc, all you all have to do is say is that you all refuse to provide it.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Luke Small
Marc, all you all have to do is say is that you all refuse to provide it.

I was asked to at least provide evidence for correctness. I did so; and I’d
say I did a stellar job aside from getting some kind of statistical program.

The following has an attached source code for my test (along with
referencing another post) in case you missed it in this ginormous thread:

https://marc.info/?l=openbsd-tech=165306234108572=2


Aside from multithreaded processes, if you provide this, folks might use
this instead of creating their own incorrect custom random number
generators to improve performance for smaller values.

That custom deterministic “arc4random()” earlier in the thread from Damien
was not very good by the way. I tested it.


> You're about one step away from getting into my spam rules.
>
> Whatever you say, the way you say it, *repeatedly* when all the people
> I know (me included) tell you to get your facts straight so you don't
> sound like a wacko, makes reading you a total waste of time.
>
-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Marc Espie
On Sat, May 21, 2022 at 07:47:40AM -0500, Luke Small wrote:
> Perhaps it was rude sending off list stuff to the list. Your email sounded
> "less than friendly" and more of a professional challenge that you were
> definitely in the works to produce; much like Damien Miller’s challenge to
> prove correctness. So, whatever.

You're about one step away from getting into my spam rules.

Whatever you say, the way you say it, *repeatedly* when all the people
I know (me included) tell you to get your facts straight so you don't
sound like a wacko, makes reading you a total waste of time.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Luke Small
Perhaps it was rude sending off list stuff to the list. Your email sounded
"less than friendly" and more of a professional challenge that you were
definitely in the works to produce; much like Damien Miller’s challenge to
prove correctness. So, whatever.

Aside from that unpleasantness:
I worked in __builtin_clz(upperbound-1) mentioned earlier in the thread;
instead of my binary search. It made the knuth sort simulation run even
faster. arc4random_uniform now takes 130% of the time mine takes for a
series of random numbers decreasing from 65535 to 2.

“__builtin_clz((upperbound - 1) | 1)” is only needed if upperbound can be 1
and that possibility is eliminated by first checking to see that upperbound
is >= 2.

static uint32_t
arc4random_uniform_small_unlocked(uint32_t upper_bound)
{
static uint64_t rand_holder = 0;
static size_t rand_bits = 0;

static size_t upper_bound0 = 2;
static size_t bitmask = 0x01;
static size_t bits = 1;

const size_t ub = upper_bound;
size_t ret;

if (ub != upper_bound0) {

if (ub < 2) {

/*
 * reset static cache for if a process needs to
fork()
 * to make it manually fork-safe
 */
if (ub == 0) {
rand_holder = 0;
rand_bits = 0;
}
return 0;
}

bits = 32 - __builtin_clz(ub - 1);
bitmask = ((size_t)1 << bits) - 1;

upper_bound0 = ub;
}

do {
if (rand_bits < bits) {
rand_holder |= ((uint64_t)arc4random()) <<
rand_bits;

/*
 * rand_bits will be between 0 and 31 here
 * so the 0x20 bit will be empty
 * rand_bits += 32;
 */
rand_bits |= 32;
}

ret = rand_holder & bitmask;
rand_holder >>= bits;
rand_bits -= bits;

} while (ret >= ub);


return (uint32_t)ret;
}


> Luke,
>
> It's very bad etiquette to deliberately re-post a private, off-list comment
> to a public mailing list.
>

Also, please fix your email client to respect the Mail-Followup-To: header,
> this is another lack of etiquette on your part.
>

I am either using gmail app on a phone or gmail.com, so I don't know if I
can help you there.


Re: Picky, but much more efficient arc4random_uniform!

2022-05-21 Thread Crystal Kolipe
On Fri, May 20, 2022 at 09:48:28PM -0500, Luke Small wrote:
> Crystal: You can prove that for random, repetitive, correct, database
> record name generation using small upperbounds, the demonstrated 1/3-1/2
> runtime isn???t worth it for an upperbound like 26 - 92 in a business context
> that fights for every last millisecond?
> 
> Bring it.
> 
> Prove the correctness of whatever algorithm you???re using while you???re at 
> it.

Luke,

It's very bad etiquette to deliberately re-post a private, off-list comment
to a public mailing list.

Just remember that you might need my help one day, and if your mail has been
sent to /dev/null or otherwise ignored then I will never see it.

It's also very rude to make demands when I've already told you that I have
other work of a much higher priority.

IF AND ONLY IF there is sufficient interest from OTHER people, (who can
indicate it to me OFF LIST), then I will consider writing up a comprehensive
article debunking your ideas and concepts.

_Quality_ work takes time, and I've only just finished writing up our
response to the claim that our asm optimisation in the rasops bit shifting
code wasn't beneficial over the C compiler output, (which of course it was).

Also, please fix your email client to respect the Mail-Followup-To: header,
this is another lack of etiquette on your part.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-20 Thread Luke Small
Crystal: You can prove that for random, repetitive, correct, database
record name generation using small upperbounds, the demonstrated 1/3-1/2
runtime isn’t worth it for an upperbound like 26 - 92 in a business context
that fights for every last millisecond?

Bring it.

Prove the correctness of whatever algorithm you’re using while you’re at it.

…unless a lot of those processes are threaded of course. :/

On Fri, May 20, 2022 at 7:27 PM Crystal Kolipe 
wrote:

>
> I've actually written a program to demonstrate how pointless your chacha
> bit
> saving routine is :).  I just haven't posted it yet because I'm too busy
> with
> other, (more useful), things to bother finishing it off.
>
> Your thread on -tech has already wasted more bits than your random number
> routine would save in a very long time, ha ha ha.
>
-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-20 Thread Luke Small
I appreciate your response, Damien.

I did do the bare minimum of correctness testing and it was the post right
after Guenther was congratulated on proving incorrectness:

https://marc.info/?l=openbsd-tech=165259528425835=2

The post includes software to reproduce the results.



I wrote a highly dynamically allocated program to test at intervals of
intervals to show at various stages to show the degree that the output
remains random

This is an example of some output:

testing arc4random_uniform(5) and arc4random_uniform_small_unlocked(5):

 256X 2048X16384X   131072X
 1048576X

 
   0 original246  2053 16400131115
1048306
 mine263  2042 16312131258
1046989

   1 original234  2013 16337130894
1047810
 mine249  2022 16304131161
1049511

   2 original236  2061 16367130802
1047430
 mine257  2117 16597130718
1046375

   3 original284  2089 16444131092
1050332
 mine266  2058 16379131190
1049877

   4 original280  2024 16372131457
1049002
 mine245  2001 16328131033
1050128



max-min values:

 original 5076   107   655
 2902
 mine 21   116   293   540
 3753


The program is set up to test values 2 through 50. This will create 224KiB
of output.
I suspected that you'd prefer that I didn't attach it.

Some progress output has been relegated to stderr so that you can easily
pipe it to a file.

I added some getrlimit an setrlimit functions to maximize memory limits if
necessary
In the case that I created for you, this will not be needed.

It uses 1276K RAM in the current configuration.


> Just to bring this back to where we came in: even putting thread-safety
> aside (which you absolutely can't): it doesn't matter how much faster
> it is, your replacement function isn't useful until you do the work to
> demonstrate correctness.
>
> You have done literally zero so far, not even the bare minimum of
> testing the output. As a result your first version was demonstrated to
> be completely broken by literally the most basic of possible tests, a
> mere 10 lines of code.
>
> That you left this to others to do tells me that you fundamentally don't
> understand the environment in which you're trying to operate, and that
> you don't respect the time of the people you're trying to convince.
>
> Please stop wasting our time.
>
> -d

-- 
-Luke
#include 
#include 
#include 
#include 
#include 
#include 
#include 

extern char *malloc_options;


/*
cc arc4random_uniform_small.c -O2 -march=native -mtune=native -flto=full \
-static -p -fno-inline && ./a.out && gprof a.out gmon.out

cc arc4random_uniform_small.c -O2 && ./a.out
 */


/* 
 * uint32_t
 * arc4random_uniform(uint32_t upper_bound);
 * 
 * for which this function is named, provides a cryptographically secure:
 * uint32_t arc4random() % upper_bound;
 * 
 * this function minimizes the waste of randomly generated bits,
 * for small upper bounds.
 * 
 * 'bits' is the requested number of random bits and it needs to be
 * within the following parameters:
 * 
 *   1 << bits >= upper_bound > (1 << bits) >> 1
 * 
 * I make a possibly incorrect presumption that size_t is
 * at the very least a uint32_t, but probably a uint64_t for speed
 */
static uint32_t
arc4random_uniform_small_unlocked(uint32_t upper_bound)
{	
	static uint64_t rand_holder = 0;
	static uint64_t rand_bits = 0;
	
	static uint64_t upper_bound0 = 2;
	static uint64_t bits0 = 1;

	uint64_t bits = 16, i = 1, n = 32, ret, ub = upper_bound;

	if (ub != upper_bound0) {
		
		if (ub < 2) {
			
			/*
			 * reset static cache for if a process needs to fork()
			 * to make it manually fork-safe
			 */
			if (ub == 0) { 
rand_holder = 0;
rand_bits = 0;
			}
			return 0;
		}
		
		/*
		 * binary search for ideal size random bitstream selection
		 */
		for (;;) {
			
			if ( ub > ((uint64_t)1 << bits) ) {

if (n - i == 1) {
	bits = n;
	break;
}

i = bits;
bits = (i + n) / 2;
continue;
			}
			if ( ub <= ((uint64_t)1 << bits) >> 1 ) {

if (n - i == 1) {
	bits = n;
	break;
}

n = bits;
bits = (i + n) / 2;
continue;
			}
			
			break;
		}
		upper_bound0 = ub;
		bits0 = bits;
	} else
		bits = bits0;
		
	const uint64_t bitmask = ((uint64_t)1 << bits) - 1;
		
	do {
		if (rand_bits < bits) {
			rand_holder |= ((uint64_t)arc4random()) << rand_bits;
			
			/* 
			 * rand_bits will be a number between 0 and 31 here
			 * so the 0x20 bit will be empty
			 * rand_bits += 32;
			 */ 
			rand_bits |= 32;
		}
		
		

Fwd: Picky, but much more efficient arc4random_uniform!

2022-05-20 Thread Luke Small
I appreciate your response, Damien.

I did do the bare minimum of correctness testing and it was the post right
after Guenther was congratulated on proving incorrectness:

https://marc.info/?l=openbsd-tech=165259528425835=2

The post includes software to reproduce the results.



I wrote a highly dynamically allocated program to test at intervals of
intervals to show at various stages to show the degree that the output
remains random

This is an example of some output:

testing arc4random_uniform(5) and arc4random_uniform_small_unlocked(5):

 256X 2048X16384X   131072X
 1048576X

 
   0 original246  2053 16400131115
1048306
 mine263  2042 16312131258
1046989

   1 original234  2013 16337130894
1047810
 mine249  2022 16304131161
1049511

   2 original236  2061 16367130802
1047430
 mine257  2117 16597130718
1046375

   3 original284  2089 16444131092
1050332
 mine266  2058 16379131190
1049877

   4 original280  2024 16372131457
1049002
 mine245  2001 16328131033
1050128



max-min values:

 original 5076   107   655
 2902
 mine 21   116   293   540
 3753


The program is set up to test values 2 through 50. This will create 224KiB
of output.
I suspected that you'd prefer that I didn't attach it.

Some progress output has been relegated to stderr so that you can easily
pipe it to a file.

I added some getrlimit an setrlimit functions to maximize memory limits if
necessary
In the case that I created for you, this will not be needed.

It uses 1276K RAM in the current configuration.


> Just to bring this back to where we came in: even putting thread-safety
> aside (which you absolutely can't): it doesn't matter how much faster
> it is, your replacement function isn't useful until you do the work to
> demonstrate correctness.
>
> You have done literally zero so far, not even the bare minimum of
> testing the output. As a result your first version was demonstrated to
> be completely broken by literally the most basic of possible tests, a
> mere 10 lines of code.
>
> That you left this to others to do tells me that you fundamentally don't
> understand the environment in which you're trying to operate, and that
> you don't respect the time of the people you're trying to convince.
>
> Please stop wasting our time.
>
> -d

-- 
-Luke


arc4random_uniform_small.c
Description: Binary data


Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Steffen Nurpmeso
Joerg Sonnenberger wrote in
 :
 |Am Wed, May 18, 2022 at 12:49:21PM +0200 schrieb Steffen Nurpmeso:
 |> What surprised me was that the Apple code requires more calls, and
 |> that today divisions and multiplications still matter.  I think it
 |> was the Cyrix 166+ (or was it Athlon 1600+) where +,-,<<,>> was
 |> one cycle, * was ten cycles, and %,/ was fourty cycles.  But
 |> i think the Core i5 8th Gen that i have requires one cycle for all
 |> of them.
 |
 |I dare you to find any CPU optimized for speed where division and modulo
 |are as fast as multiplication. What you have nowadays is a single cycle
 |for basic operations like additions and shifts, about three cycles for
 |multiplications (on x86 at least) and an order of magnitude more for
 |division. That's not really surprising either if you consider that

Thanks.  I really had not looked since about 2005.  (In fact
i never really looked, i think a c't magazine CD once came over
with x86 interrupt lists, CPU cycle usage, and a (sketchy) RFC
collection.)  Hm, it even seems the number of cycles increased for
division .. i do not know which numbers i had seen by then, did
they include cache work at all?  Well.  So this was that.

 |integer multiplication has a fast circuit implementation where as
 |division is often implemented as cascading conditional subtraction.
 |The difference matters enough turning the reminder operation in the ELF
 |hash into essentially (x - x * (1/size) * size) give a 2% speed up for a
 |large application like Firefox ten years ago.

Interesting.  Well you live in pkgsrc with all the mess that all
those huge software suites introduce, whereas i still "jerk" if
i produce software which needs more than EXEC (/ RODATA) and BSS,
and "shudder" for all the CTOR and DTOR and TLS (but ok: cool) and
all the other stuff that is found in ELF, and handled by the
linker.  (I have ELF v1.2 and TLS-ABI, and even glanced over them
a really long time ago though.  But you know, it is so far off!
And it makes no fun thinking about that :))

 |Joerg
 --End of 

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Joerg Sonnenberger
Am Wed, May 18, 2022 at 12:49:21PM +0200 schrieb Steffen Nurpmeso:
> What surprised me was that the Apple code requires more calls, and
> that today divisions and multiplications still matter.  I think it
> was the Cyrix 166+ (or was it Athlon 1600+) where +,-,<<,>> was
> one cycle, * was ten cycles, and %,/ was fourty cycles.  But
> i think the Core i5 8th Gen that i have requires one cycle for all
> of them.

I dare you to find any CPU optimized for speed where division and modulo
are as fast as multiplication. What you have nowadays is a single cycle
for basic operations like additions and shifts, about three cycles for
multiplications (on x86 at least) and an order of magnitude more for
division. That's not really surprising either if you consider that
integer multiplication has a fast circuit implementation where as
division is often implemented as cascading conditional subtraction.
The difference matters enough turning the reminder operation in the ELF
hash into essentially (x - x * (1/size) * size) give a 2% speed up for a
large application like Firefox ten years ago.

Joerg



Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Steffen Nurpmeso
Steffen Nurpmeso wrote in
 <20220518104921.sr0ht%stef...@sdaoden.eu>:
 ...
 |The web site that has been linked from the man from the country
 |that has an even much worse Earth Country Overshoot Day than
 |Germany and is almost en par with Australia or even USA (April
 |3rd, pooh; never again a Saab!  Cuba: Nov 25th, Jamaica Dec 20th)

Finland is March 31st.  Man is that dirty.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Steffen Nurpmeso
Philip Guenther wrote in
 :
 |On Tue, May 17, 2022 at 1:10 PM Steffen Nurpmeso  \
 |wrote:
 |> Joerg Sonnenberger wrote in
 |>  :
 |>|Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
 |>|> I made a couple new versions of a new kind of arc4random_uniform-like
 |>  ...
 |>|If your main use case is limiting the amount of cryptography when using
 |>|small bounds, there is a much simpler approach to be taken here. For
 |>|boundaries below 256, use arc4random_buf to extract one byte if bound is
 |>|a power of two, otherwise two. This gives most of the performance
 ...
 |> You can use (really implemented) _buf() if you need a 8-bit or
 |> 16-bit etc number.
 |>
 |> I find that _uniform() often makes no difference to a simple
 |> modulo because like the comment in _uniform() says "p > 0.5 (worst
 |>case, usually far better", and usually RNGs sprinkle bits nicely,
 |
 |What does that statement mean?  You seem to be saying "module is uniform,
 |except when it isn't, which could be almost half the time for some cases,
 |but when it's uniform it's uniform, so why bother making it actually
 |correct and dependable".
 |
 |I mean, what does that _mean_???   It's as if I said "my text handling

Well it means this thread was too long.

 |program handles all characters uniformly, except those with accents, but
 |that's less than 10% of the characters I type, so it handles all characters
 |uniformly."  WTF, NO!

But it also means that

// calculates 2**32 % range
uint32_t t = (-range) % range;
for (;;) {
uint32_t r = rng();
if (r >= t)

where range is a very small number results in a very, very low
probability that r>=t is not true.  For 16-bit 0x two zero
bytes had to occur in the upper 16-bit.  And worse for 64-bit RNG.
So this is what i said.  It depends on the application.

This gets hypothetic and is anyway beyond my mathematical
capabilities.  I mean i look at /etc/ssh/moduli, so much on
cramping of random numbers.
The web site that has been linked from the man from the country
that has an even much worse Earth Country Overshoot Day than
Germany and is almost en par with Australia or even USA (April
3rd, pooh; never again a Saab!  Cuba: Nov 25th, Jamaica Dec 20th)
said the bias for the number 52 is 0.0121%.  And what i posted
had ~0.008 that rand<0x01FF aka 32-bit high byte is 0, for
32-bit from getrandom(2) as well as mine (in "strong" mode,
SipHash on ARC4).

You need quite some samples to be able to find a statistical
frequency of that order.  And it depends on the application of
that many samples exist.
And even TCP from RFC 793 / September 1981 has a 32-bit sequence
number.

But sure Philip, of course, yes, of course you are right: simply
call _uniform() and "milk the shit out of the random range" --
just use it and forget about the problem.

What surprised me was that the Apple code requires more calls, and
that today divisions and multiplications still matter.  I think it
was the Cyrix 166+ (or was it Athlon 1600+) where +,-,<<,>> was
one cycle, * was ten cycles, and %,/ was fourty cycles.  But
i think the Core i5 8th Gen that i have requires one cycle for all
of them.  (And somewhere i read that there are CPUs where the
shift operators are now more expensive than multiplication.)
I do not really track any of that since 2005.  That is nice: you
buy a laptop and suddenly have a NVME SSD that can 1.5GB/sec.
Wow.

 |> 0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in
 |> most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793
 |> NOT for Linux getrandom(2)), which is a pretty high cut off.
 ...
 |Where do these ideas come from, that "0 bytes 'do not occur'"??  If your
 |rand generator doesn't provide zero bytes at the expected frequency, you
 |know, 1 in 256, then you're using a garbage random number generator.
 |Please stop making such suggestions here because THEY ARE NOT TRUE ABOUT
 |OPENBSD.  Do ya'll not bother to test the claims that you make?
 ...
 |for (;;) {
 |u = arc4random();
 |if ((u & 0xff00) == 0 ||
 |(u & 0x00ff) == 0 ||
 |(u & 0xff00) == 0 ||
 |(u & 0x00ff) == 0)
 |break;

That is any-byte-0.

 |00b82e5c58
 |ab47880036

Seems random to me.
:)

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Otto Moerbeek
On Wed, May 18, 2022 at 05:08:02PM +1000, Damien Miller wrote:

> On Wed, 18 May 2022, Otto Moerbeek wrote:
> 
> > instrumenting the code to count the number of arc4random calls I see thsi:
> > 
> > openbsd; elapsed = 2.835819; calls = 12340949
> > bitmask; elapsed = 4.335576; calls = 17836216
> > bitmask+reuse; elapsed = 3.710277; calls = 15245337
> > 
> > (this is a different number of trials on a slow machine).
> > 
> > So the promise of less calls is not fulfilled. Sounds like a bug.
> 
> Well, I don't know whether the simple bitmasking approach will really
> result in fewer calls. I *think* our current approach has a higher
> probability per loop of suceeding (at least for small upper_bounds) than
> mask-and-test, but my brain is too addled with a cold to calculate it :/

I *assumed* the speedups seen in the web page were because of less rnd
calls.  That was a mistake.

> What Theo said is 100% right - the cost is dominated by that of the
> underlying RNG. If anyone wants a faster arc4random_uniform() then the
> first place to look it at arc4random().

One more reason to keep the current implementation of arc4random_uniform().

> 
> It's entirely possible that the speedups measured in that article
> are because they are using a omgoptimised (non-crypto) RNG and the
> improvements they saw were due solely to reduction the overhead inside
> the uniformity code even if it came at the cost of more RNG queries.

Right.

> Anyway, here's a tweaked up version of the test harness that fakes out
> arc4random() with a deterministic RNG that counts how often it is called
> in case anyone wants to play with it further.

-Otto


> 
> 
> 
> 
> #include 
> #include 
> 
> #include 
> #include 
> 
> static uint32_t nqueries;
> 
> /* deterministic, query-counting arc4random() replacement for benchmarking */
> static uint32_t
> fake_random(void)
> {
>   static uint32_t ready, remain, msb;
>   uint32_t ret;
> 
>   if (!ready) {
>   srand_deterministic(31337);
>   ready = 1;
>   }
>   if (remain == 0) {
>   msb = (uint32_t)rand();
>   remain = 31; /* XXX makes assumption re RAND_MAX */
>   }
>   ret = (uint32_t)rand() | (msb << 31);
>   msb >>= 1;
>   remain--;
>   nqueries++;
>   return ret; 
> }
> 
> #define arc4random() fake_random()
> 
> #ifndef __has_builtin
> #define __has_builtin(x) 0
> #endif
> #if __has_builtin(__builtin_clz)
> #define arc4random_clz(x) __builtin_clz(x)
> #else
> #warning lacks __builtin_clz()
> /* Count most-significant zero bits, like __builtin_clz() */
> static int
> arc4random_clz(unsigned int x)
> {
>   int ret = 0;
>   unsigned int n;
>   const int lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
> 
>   while (x != 0) {
>   n = (x >> 28) & 0xf;
>   if (n != 0)
>   return ret + lut[n];
>   x <<= 4;
>   ret += 4;
>   }
>   return 32;
> }
> #endif
> 
> /* Current OpenBSD implementation */
> static uint32_t
> arc4random_uniform1(uint32_t upper_bound)
> {
>   uint32_t r, min;
> 
>   if (upper_bound < 2)
>   return 0;
> 
>   /* 2**32 % x == (2**32 - x) % x */
>   min = -upper_bound % upper_bound;
> 
>   /*
>* This could theoretically loop forever but each retry has
>* p > 0.5 (worst case, usually far better) of selecting a
>* number inside the range we need, so it should rarely need
>* to re-roll.
>*/
>   for (;;) {
>   r = arc4random();
>   if (r >= min)
>   break;
>   }
> 
>   return r % upper_bound;
> }
> 
> /*
>  * Like "Bitmask with Rejection" implementation from
>  * https://www.pcg-random.org/posts/bounded-rands.html
>  */
> static uint32_t
> arc4random_uniform2(uint32_t upper_bound)
> {
>   uint32_t mask, r;
> 
>   if (upper_bound < 2)
>   return 0;
> 
>   mask = 0x >> arc4random_clz((upper_bound - 1) | 1);;
>   for (;;) {
>   r = arc4random() & mask;
>   if (r < upper_bound)
>   return r;
>   }
>   /* NOTREACHED */
> }
> 
> /*
>  * Like Apple's
>  * 
> https://opensource.apple.com/source/Libc/Libc-1158.50.2/gen/FreeBSD/arc4random.c
>  */
> static uint32_t
> arc4random_uniform3(uint32_t upper_bound)
> {
>   int zeroes, bits, remain;
>   uint32_t mask, r, rm;
> 
>   if (upper_bound < 2)
>   return 0;
> 
>   zeroes = arc4random_clz((upper_bound - 1) | 1);
>   bits = 32 - zeroes;
>   mask = (uint32_t)-1 >> zeroes;
> 
>   for (;;) {
>   r = arc4random();
>   rm = r & mask;
>   if (rm < upper_bound)
>   return rm;
>   for (remain = zeroes; remain >= bits; remain -= bits) {
>   r >>= bits;
>   rm = r & mask;
>   

Re: Picky, but much more efficient arc4random_uniform!

2022-05-18 Thread Damien Miller
On Wed, 18 May 2022, Otto Moerbeek wrote:

> instrumenting the code to count the number of arc4random calls I see thsi:
> 
> openbsd; elapsed = 2.835819; calls = 12340949
> bitmask; elapsed = 4.335576; calls = 17836216
> bitmask+reuse; elapsed = 3.710277; calls = 15245337
> 
> (this is a different number of trials on a slow machine).
> 
> So the promise of less calls is not fulfilled. Sounds like a bug.

Well, I don't know whether the simple bitmasking approach will really
result in fewer calls. I *think* our current approach has a higher
probability per loop of suceeding (at least for small upper_bounds) than
mask-and-test, but my brain is too addled with a cold to calculate it :/

What Theo said is 100% right - the cost is dominated by that of the
underlying RNG. If anyone wants a faster arc4random_uniform() then the
first place to look it at arc4random().

It's entirely possible that the speedups measured in that article
are because they are using a omgoptimised (non-crypto) RNG and the
improvements they saw were due solely to reduction the overhead inside
the uniformity code even if it came at the cost of more RNG queries.

Anyway, here's a tweaked up version of the test harness that fakes out
arc4random() with a deterministic RNG that counts how often it is called
in case anyone wants to play with it further.




#include 
#include 

#include 
#include 

static uint32_t nqueries;

/* deterministic, query-counting arc4random() replacement for benchmarking */
static uint32_t
fake_random(void)
{
static uint32_t ready, remain, msb;
uint32_t ret;

if (!ready) {
srand_deterministic(31337);
ready = 1;
}
if (remain == 0) {
msb = (uint32_t)rand();
remain = 31; /* XXX makes assumption re RAND_MAX */
}
ret = (uint32_t)rand() | (msb << 31);
msb >>= 1;
remain--;
nqueries++;
return ret; 
}

#define arc4random() fake_random()

#ifndef __has_builtin
#define __has_builtin(x) 0
#endif
#if __has_builtin(__builtin_clz)
#define arc4random_clz(x) __builtin_clz(x)
#else
#warning lacks __builtin_clz()
/* Count most-significant zero bits, like __builtin_clz() */
static int
arc4random_clz(unsigned int x)
{
int ret = 0;
unsigned int n;
const int lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };

while (x != 0) {
n = (x >> 28) & 0xf;
if (n != 0)
return ret + lut[n];
x <<= 4;
ret += 4;
}
return 32;
}
#endif

/* Current OpenBSD implementation */
static uint32_t
arc4random_uniform1(uint32_t upper_bound)
{
uint32_t r, min;

if (upper_bound < 2)
return 0;

/* 2**32 % x == (2**32 - x) % x */
min = -upper_bound % upper_bound;

/*
 * This could theoretically loop forever but each retry has
 * p > 0.5 (worst case, usually far better) of selecting a
 * number inside the range we need, so it should rarely need
 * to re-roll.
 */
for (;;) {
r = arc4random();
if (r >= min)
break;
}

return r % upper_bound;
}

/*
 * Like "Bitmask with Rejection" implementation from
 * https://www.pcg-random.org/posts/bounded-rands.html
 */
static uint32_t
arc4random_uniform2(uint32_t upper_bound)
{
uint32_t mask, r;

if (upper_bound < 2)
return 0;

mask = 0x >> arc4random_clz((upper_bound - 1) | 1);;
for (;;) {
r = arc4random() & mask;
if (r < upper_bound)
return r;
}
/* NOTREACHED */
}

/*
 * Like Apple's
 * 
https://opensource.apple.com/source/Libc/Libc-1158.50.2/gen/FreeBSD/arc4random.c
 */
static uint32_t
arc4random_uniform3(uint32_t upper_bound)
{
int zeroes, bits, remain;
uint32_t mask, r, rm;

if (upper_bound < 2)
return 0;

zeroes = arc4random_clz((upper_bound - 1) | 1);
bits = 32 - zeroes;
mask = (uint32_t)-1 >> zeroes;

for (;;) {
r = arc4random();
rm = r & mask;
if (rm < upper_bound)
return rm;
for (remain = zeroes; remain >= bits; remain -= bits) {
r >>= bits;
rm = r & mask;
if (rm < upper_bound)
return rm;
}
}
/* NOTREACHED */
}

#include 

static uint32_t
fixture(const char *s, uint32_t (* const rnd)(uint32_t))
{
const uint32_t trials = 20 * 1000 * 1000;
const uint32_t max = 0x8000;
const uint32_t mul = 7;
unsigned int v, i, r;
struct timeval start, finish, delta;

nqueries = 0;
gettimeofday(, 

Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Otto Moerbeek
On Wed, May 18, 2022 at 02:50:28PM +1000, Damien Miller wrote:

> On Tue, 17 May 2022, Raimo Niskanen wrote:
> 
> > Why reinvent the wheel?
> > 
> > Here is a pretty good walkthrough of established methods:
> > 
> > https://www.pcg-random.org/posts/bounded-rands.html
> > 
> > It sounds to me as if your suggested methor essentially is
> > "Bitmask with Rejection -- Apple's Method" with the added twist
> > to keep the unused bits of the base generator's word and append
> > new, which might be an optimization for small ranges, but might
> > be just overhead for large ones.
> > 
> > In that post one can see that there might be other suggested smaller
> > optimizations that can be applied to the OpenBSD method, and that
> > the bitmask method is effective in many cases but not a clear winner.
> 
> Oh nice, I wasn't aware that Apple had an improved algorithm. I always
> thought the best avenue for a speedup was to consider only the bits that
> could satisfy the constraint, but it never occurred to me how to actually
> make use of this :)
> 
> The toy benchmark below compares the existing implementation with
> reimplementations of both their version as well as something more close
> to Apple's actual method (which reuses the high bits).
> 
> However, I don't see a speedup for either of the alternate approaches.
> 
> [djm@djm ~]$ cc -O2 -Werror -Wall -Wextra -o rb rb.c && doas nice -n -20 ./rb 
> openbsd; elapsed = 8.327954
> bitmask; elapsed = 13.306228
> bitmask+reuse; elapsed = 11.567389

instrumenting the code to count the number of arc4random calls I see thsi:

openbsd; elapsed = 2.835819; calls = 12340949
bitmask; elapsed = 4.335576; calls = 17836216
bitmask+reuse; elapsed = 3.710277; calls = 15245337

(this is a different number of trials on a slow machine).

So the promise of less calls is not fulfilled. Sounds like a bug.

-Otto



Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Theo de Raadt
> However, I don't see a speedup for either of the alternate approaches.

Or maybe, just maybe, the underlying function (arc4random, which in the
dominant case it is just a small chacha step) is so inexpensive relative
to the proposed higher-level logic changes?  So this is all tilting at
windmills, and also something about measuring before optimizing^Wbreaking...



Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Damien Miller
On Tue, 17 May 2022, Raimo Niskanen wrote:

> Why reinvent the wheel?
> 
> Here is a pretty good walkthrough of established methods:
> 
> https://www.pcg-random.org/posts/bounded-rands.html
> 
> It sounds to me as if your suggested methor essentially is
> "Bitmask with Rejection -- Apple's Method" with the added twist
> to keep the unused bits of the base generator's word and append
> new, which might be an optimization for small ranges, but might
> be just overhead for large ones.
> 
> In that post one can see that there might be other suggested smaller
> optimizations that can be applied to the OpenBSD method, and that
> the bitmask method is effective in many cases but not a clear winner.

Oh nice, I wasn't aware that Apple had an improved algorithm. I always
thought the best avenue for a speedup was to consider only the bits that
could satisfy the constraint, but it never occurred to me how to actually
make use of this :)

The toy benchmark below compares the existing implementation with
reimplementations of both their version as well as something more close
to Apple's actual method (which reuses the high bits).

However, I don't see a speedup for either of the alternate approaches.

[djm@djm ~]$ cc -O2 -Werror -Wall -Wextra -o rb rb.c && doas nice -n -20 ./rb 
openbsd; elapsed = 8.327954
bitmask; elapsed = 13.306228
bitmask+reuse; elapsed = 11.567389

Maybe my benchmark is crap or maybe I need dlg@ to omgoptimize it for me...



#include 
#include 
#include 
#include 

#ifndef __has_builtin
#define __has_builtin(x) 0
#endif
#if __has_builtin(__builtin_clz)
#define arc4random_clz(x) __builtin_clz(x)
#else
#warning lacks __builtin_clz()
/* Count most-significant zero bits, like __builtin_clz() */
static int
arc4random_clz(unsigned int x)
{
int ret = 0;
unsigned int n;
const int lut[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };

while (x != 0) {
n = (x >> 28) & 0xf;
if (n != 0)
return ret + lut[n];
x <<= 4;
ret += 4;
}
return 32;
}
#endif

/* Current OpenBSD implementation */
static uint32_t
arc4random_uniform1(uint32_t upper_bound)
{
uint32_t r, min;

if (upper_bound < 2)
return 0;

/* 2**32 % x == (2**32 - x) % x */
min = -upper_bound % upper_bound;

/*
 * This could theoretically loop forever but each retry has
 * p > 0.5 (worst case, usually far better) of selecting a
 * number inside the range we need, so it should rarely need
 * to re-roll.
 */
for (;;) {
r = arc4random();
if (r >= min)
break;
}

return r % upper_bound;
}

/*
 * Like "Bitmask with Rejection" implementation from
 * https://www.pcg-random.org/posts/bounded-rands.html
 */
static uint32_t
arc4random_uniform2(uint32_t upper_bound)
{
uint32_t mask, r;

if (upper_bound < 2)
return 0;

mask = (uint32_t)-1 >> arc4random_clz((upper_bound - 1) | 1);;
for (;;) {
r = arc4random();
if ((r & mask) < upper_bound)
return r & mask;
}
/* NOTREACHED */
}

/*
 * Like Apple's
 * 
https://opensource.apple.com/source/Libc/Libc-1158.50.2/gen/FreeBSD/arc4random.c
 */
static uint32_t
arc4random_uniform3(uint32_t upper_bound)
{
int zeroes, bits, remain = 32;
uint32_t mask, r;

if (upper_bound < 2)
return 0;

zeroes = arc4random_clz((upper_bound - 1) | 1);
bits = 32 - zeroes;
mask = (uint32_t)-1 >> zeroes;

for (;;) {
r = arc4random();
if ((r & mask) < upper_bound)
return r & mask;
for (remain = zeroes; remain >= bits; remain -= bits) {
r >>= bits;
if ((r & mask) < upper_bound)
return r & mask;
}
}
/* NOTREACHED */
}


#include 

static uint32_t
fixture(const char *s, uint32_t (* const rnd)(uint32_t))
{
const uint32_t trials = 50 * 1000 * 1000;
const uint32_t max = 0x8000;
const uint32_t mul = 7;
unsigned int v, i, r;
struct timeval start, finish, delta;

gettimeofday(, NULL);
for (v = 3, r = 0; v < max; v *= mul) {
/* printf("%u\n", v); */
for (i = 0; i < trials; i++)
r |= rnd(v);
}
gettimeofday(, NULL);
timersub(, , );
printf("%s; elapsed = %lld.%06lld\n", s,
(long long)delta.tv_sec, (long long)delta.tv_usec);
return r;
}

int main(void) {
fixture("openbsd", arc4random_uniform1);
fixture("bitmask", arc4random_uniform2);
fixture("bitmask+reuse", arc4random_uniform3);
}



Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Philip Guenther
On Tue, May 17, 2022 at 1:10 PM Steffen Nurpmeso  wrote:

> Joerg Sonnenberger wrote in
>  :
>  |Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
>  |> I made a couple new versions of a new kind of arc4random_uniform-like
>  ...
>  |If your main use case is limiting the amount of cryptography when using
>  |small bounds, there is a much simpler approach to be taken here. For
>  |boundaries below 256, use arc4random_buf to extract one byte if bound is
>  |a power of two, otherwise two. This gives most of the performance
>  |benefit without complicating the algorithm. Extracting two bytes ensures
>  |that the propability of success is > 99% and the double extracting
>  |doesn't eat up the benefits.
>
> You can use (really implemented) _buf() if you need a 8-bit or
> 16-bit etc number.
>
> I find that _uniform() often makes no difference to a simple
> modulo because like the comment in _uniform() says "p > 0.5 (worst

case, usually far better", and usually RNGs sprinkle bits nicely,
>

What does that statement mean?  You seem to be saying "module is uniform,
except when it isn't, which could be almost half the time for some cases,
but when it's uniform it's uniform, so why bother making it actually
correct and dependable".

I mean, what does that _mean_???   It's as if I said "my text handling
program handles all characters uniformly, except those with accents, but
that's less than 10% of the characters I type, so it handles all characters
uniformly."  WTF, NO!



> 0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in
> most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793
> NOT for Linux getrandom(2)), which is a pretty high cut off.
> Using _uniform() just because of its name seems strange thus.
>

Where do these ideas come from, that "0 bytes 'do not occur'"??  If your
rand generator doesn't provide zero bytes at the expected frequency, you
know, 1 in 256, then you're using a garbage random number generator.
Please stop making such suggestions here because THEY ARE NOT TRUE ABOUT
OPENBSD.  Do ya'll not bother to test the claims that you make?

: bleys; cat f.c
#include 
#include 

int
main(void)
{
uint32_t u;
long count;

count = 0;
while ((u = arc4random()) > 0x1ff)
count++;
printf("%08x\t%ld\n", u, count);

count = 0;
for (;;) {
u = arc4random();
if ((u & 0xff00) == 0 ||
(u & 0x00ff) == 0 ||
(u & 0xff00) == 0 ||
(u & 0x00ff) == 0)
break;
count++;
}
printf("%08x\t%ld\n", u, count);

return 0;
}
: bleys; cc f.c
: bleys; ./a.out
00b82e5c58
ab47880036
: bleys;


Philip Guenther


Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Steffen Nurpmeso
Steffen Nurpmeso wrote in
 <20220517220924.xohqc%stef...@sdaoden.eu>:
 |Joerg Sonnenberger wrote in
 | :
 ||Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
 ||> I made a couple new versions of a new kind of arc4random_uniform-like
 ...
 |0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in
 |most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793
 |NOT for Linux getrandom(2)), which is a pretty high cut off.
 ...

  /*@ lotto.cxx - output six random numbers */

  #include 
  #include 

  #include 

  //#define NYDPROF_ENABLE
  //#define NYD_ENABLE
  //#define NYD2_ENABLE
  #include 
  NSPC_USE(su)

  static u32
  bounded_rand(u32 range, u32 rv){
 for(u32 t = -range % range;;){
if(rv >= t)
   return rv % range;
log::write(log::emerg, "NOFIT");
 }
  }

  int
  main(void){
 u32 uni[50], mod[50], rv;

 state::create_core(NIL, (state::debug | log::debug), state::err_nopass);

 su_mem_set(uni, 0, sizeof uni);
 su_mem_set(mod, 0, sizeof mod);

  u32 au=0;
 for(u32 i = 0; i < 10; ++i){
random::builtin_generate(, sizeof rv, state::err_nopass);
//if(getrandom(, sizeof rv, GRND_NONBLOCK) == -1)
//   log::write(log::emerg, "getrandom(2)");
  if(rv < 0x01FF)
  log::write(log::info, "AU %u - 0x%X", ++au,rv);
++mod[rv % NELEM(mod)];
++uni[bounded_rand(NELEM(mod), rv)];
 }

 u32 unilo, modlo, unihi, modhi;
 unilo = modlo = max::u32;
 unihi = modhi = 0;

 for(u32 i = 0; i < NELEM(uni); ++i){
unilo = get_min(unilo, uni[i]);
modlo = get_min(modlo, mod[i]);
unihi = get_max(unihi, uni[i]);
modhi = get_max(modhi, mod[i]);
log::write(log::info, "%u - %u / %u %s",
   i, uni[i], mod[i], (uni[i] != mod[i] ? "!!!" : "="));
 }
 log::write(log::info, "MIN %u / %u, MAX %u / %u - au %u",
   unilo, modlo, unihi, modhi,au);

 return 0;
  }

  #include 
  // s-it-mode

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Steffen Nurpmeso
Joerg Sonnenberger wrote in
 :
 |Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
 |> I made a couple new versions of a new kind of arc4random_uniform-like
 ...
 |If your main use case is limiting the amount of cryptography when using
 |small bounds, there is a much simpler approach to be taken here. For
 |boundaries below 256, use arc4random_buf to extract one byte if bound is
 |a power of two, otherwise two. This gives most of the performance
 |benefit without complicating the algorithm. Extracting two bytes ensures
 |that the propability of success is > 99% and the double extracting
 |doesn't eat up the benefits.

You can use (really implemented) _buf() if you need a 8-bit or
16-bit etc number.

I find that _uniform() often makes no difference to a simple
modulo because like the comment in _uniform() says "p > 0.5 (worst
case, usually far better", and usually RNGs sprinkle bits nicely,
0 bytes "do not occur", so a 32-bit RNG value "is" >=0x01FF in
most cases for "my RNG" (of 10 803/759/793 NOT; 776/805/793
NOT for Linux getrandom(2)), which is a pretty high cut off.
Using _uniform() just because of its name seems strange thus.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-17 Thread Joerg Sonnenberger
Am Fri, May 13, 2022 at 09:43:26AM -0500 schrieb Luke Small:
> I made a couple new versions of a new kind of arc4random_uniform-like
> function and some example functions which use them. Instead of having a
> sufficiently large random number greater than the modulus, I pick a random
> number using arc4random() from a bitfield where the length of the bitfield
> is just below or slightly beyond the value of the modulus and returns the
> bitfield it if it is less than the value of the modulus.

If your main use case is limiting the amount of cryptography when using
small bounds, there is a much simpler approach to be taken here. For
boundaries below 256, use arc4random_buf to extract one byte if bound is
a power of two, otherwise two. This gives most of the performance
benefit without complicating the algorithm. Extracting two bytes ensures
that the propability of success is > 99% and the double extracting
doesn't eat up the benefits.

Joerg



Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Damien Miller
On Mon, 16 May 2022, Luke Small wrote:

> Yeah, I see your point.
>
> I suppose it depends on how conservative you want to be and whether
> you want to supply options to people like getchar_unlocked when it
> isn’t essential.
>
> It could be made manually fork-safe if I could make a simple feature
> where “arc4random_uniform_unlocked(0);” with a 0 upperbound could
> trigger a reset of the static variables rand_bits and rand_holder
> which would be simple enough and could be added to a man page. I
> certainly read the man pages. (I’m surprised “man getchar” doesn’t
> “see also” to getchar_unlocked() by the way.)
>
> If you look at profiling with programs that call it a lot,
> arc4random() inside arc4random_uniform() calls the expensive rekey
> function which makes it take more time. That’s why I can get around
> 2X-3X performance on a typical repetitive small upperbound loop and
> an extra 20% improvement on a single 65536 Knuth shuffle, loop even
> though my function repeats a binary search for ‘bit’ size every single
> time and has misses which demands calling up more data.
>
> Otherwise my function would be particularly useful when there’s a loop
> with small upperbound like: 26+26+10 which, if I recall correctly, is
> in identd, which would call it frequently.

Just to bring this back to where we came in: even putting thread-safety
aside (which you absolutely can't): it doesn't matter how much faster
it is, your replacement function isn't useful until you do the work to
demonstrate correctness.

You have done literally zero so far, not even the bare minimum of
testing the output. As a result your first version was demonstrated to
be completely broken by literally the most basic of possible tests, a
mere 10 lines of code.

That you left this to others to do tells me that you fundamentally don't
understand the environment in which you're trying to operate, and that
you don't respect the time of the people you're trying to convince.

Please stop wasting our time.

-d


Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Luke Small
Yeah, I see your point.

I suppose it depends on how conservative you want to be and whether you
want to supply options to people like getchar_unlocked when it isn’t
essential.

It could be made manually fork-safe if I could make a simple feature where
“arc4random_uniform_unlocked(0);” with a 0 upperbound could trigger a reset
of the static variables rand_bits and rand_holder which would be simple
enough and could be added to a man page. I certainly read the man pages.
(I’m surprised “man getchar” doesn’t “see also” to getchar_unlocked() by
the way.)

If you look at profiling with programs that call it a lot, arc4random()
inside arc4random_uniform() calls the expensive rekey function which makes
it take more time. That’s why I can get around 2X-3X performance on a
typical repetitive small upperbound loop and an extra 20% improvement on a
single 65536 Knuth shuffle, loop even though my function repeats a binary
search for ‘bit’ size every single time and has misses which demands
calling up more data.

Otherwise my function would be particularly useful when there’s a loop with
small upperbound like: 26+26+10 which, if I recall correctly, is in identd,
which would call it frequently.

I have a project that I generate MANY random record names much like that
and I use fork()/fork()-exec() on many processes well before calling
randomizing functions. Maybe I’m not the only one.

I don’t want to even try multithreading, especially as often as you guys
say that it is too unpredictable. I believe you. Using lua scripts in
multi-worker-process redis calls to avoid race conditions is awkward enough
for me.


But otherwise it’s certainly your prerogative to not have it. At least you
can’t legitimately say that it adds too much extra code or is too complex.
It’s pretty simple to understand if you’re familiar with bitwise stuff.

On Mon, May 16, 2022 at 5:35 PM Stuart Henderson 
wrote:

> On 2022/05/16 15:13, Luke Small wrote:
> > If you’re not running a threaded program, my function wouldn’t be “less
> > safe.”
> >
> > I’d imagine that 99% of programs aren’t multithreaded.
>
> code is reused in different places. non threaded programs are sometimes
> turned into threaded programs and when that happens, sometimes
> non-thread-safe calls are missed. so i'd argue that it is still less
> safe.
>
> in some cases there might be benefits that would mean it's worth it,
> especially if the failure modes would be obvious such that they can
> be detected. really not seeing that here. (how often are you even
> calling arc4random_uniform to consider it slow?!)
>
> if the consequence is not a crash but instead subtly broken randomness,
> how long do you think it's going to take to notice and report/fix it?
> even *very* broken randomness in widespread software distributions
> has been known to sit around for a long time before it's noticed:
>
> - predictable rng in a popular os. *serious* bug. introduced 2006,
> discovered/reported nearly 2 years later.
>
> - non-fork-safe rng in a popular ssl library, introduced sometime before
> sept 2018, reported may 2019.
>
> --
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Stuart Henderson
On 2022/05/16 15:13, Luke Small wrote:
> If you’re not running a threaded program, my function wouldn’t be “less
> safe.”
> 
> I’d imagine that 99% of programs aren’t multithreaded.

code is reused in different places. non threaded programs are sometimes
turned into threaded programs and when that happens, sometimes
non-thread-safe calls are missed. so i'd argue that it is still less
safe.

in some cases there might be benefits that would mean it's worth it,
especially if the failure modes would be obvious such that they can
be detected. really not seeing that here. (how often are you even
calling arc4random_uniform to consider it slow?!)

if the consequence is not a crash but instead subtly broken randomness,
how long do you think it's going to take to notice and report/fix it?
even *very* broken randomness in widespread software distributions
has been known to sit around for a long time before it's noticed:

- predictable rng in a popular os. *serious* bug. introduced 2006,
discovered/reported nearly 2 years later.

- non-fork-safe rng in a popular ssl library, introduced sometime before
sept 2018, reported may 2019.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Luke Small
No...am I incorrect, especially on OpenBSD?

Of course since you made such a remark, you seem like the kind of fellow
that would put the nail in the coffin for spite.

...now I sound like an asshole.

On Mon, May 16, 2022 at 4:00 PM Theo de Raadt  wrote:

> hey luke you know you sound like an asshole right?
>
>
> Luke Small  wrote:
>
> > If you’re not running a threaded program, my function wouldn’t be “less
> > safe.”
> >
> > I’d imagine that 99% of programs aren’t multithreaded.
> >
> > On Mon, May 16, 2022 at 1:01 PM  wrote:
> >
> > > > There is the specifically non-threadsafe call getchar_unlocked() on
> > > OpenBSD
> > > > which is presumably available for performance reasons alone, when
> > > getchar()
> > > > is a perfectly viable option and is even an ISO conforming function.
> > > What I
> > > > submitted could be such a higher performance non-threadsafe function.
> > > >
> > > > so, how about arc4random_uniform_unlocked() ?!
> > >
> > > getchar_unlocked is mandated by POSIX.
> > >
> > > OpenBSD has not yet invented an alternate function that only exists to
> > > give away safety for performance. It has only gone in the opposite
> > > direction if anything.
> > >
> > > --
> > -Luke
>


Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Theo de Raadt
hey luke you know you sound like an asshole right?


Luke Small  wrote:

> If you’re not running a threaded program, my function wouldn’t be “less
> safe.”
> 
> I’d imagine that 99% of programs aren’t multithreaded.
> 
> On Mon, May 16, 2022 at 1:01 PM  wrote:
> 
> > > There is the specifically non-threadsafe call getchar_unlocked() on
> > OpenBSD
> > > which is presumably available for performance reasons alone, when
> > getchar()
> > > is a perfectly viable option and is even an ISO conforming function.
> > What I
> > > submitted could be such a higher performance non-threadsafe function.
> > >
> > > so, how about arc4random_uniform_unlocked() ?!
> >
> > getchar_unlocked is mandated by POSIX.
> >
> > OpenBSD has not yet invented an alternate function that only exists to
> > give away safety for performance. It has only gone in the opposite
> > direction if anything.
> >
> > --
> -Luke



Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Luke Small
If you’re not running a threaded program, my function wouldn’t be “less
safe.”

I’d imagine that 99% of programs aren’t multithreaded.

On Mon, May 16, 2022 at 1:01 PM  wrote:

> > There is the specifically non-threadsafe call getchar_unlocked() on
> OpenBSD
> > which is presumably available for performance reasons alone, when
> getchar()
> > is a perfectly viable option and is even an ISO conforming function.
> What I
> > submitted could be such a higher performance non-threadsafe function.
> >
> > so, how about arc4random_uniform_unlocked() ?!
>
> getchar_unlocked is mandated by POSIX.
>
> OpenBSD has not yet invented an alternate function that only exists to
> give away safety for performance. It has only gone in the opposite
> direction if anything.
>
> --
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Steffen Nurpmeso
Philip Guenther wrote in
 :
 |On Sun, 15 May 2022, Steffen Nurpmeso wrote:
 |> Stuart Henderson wrote in
 |...
 |>|what's the perceived problem you're wanting to solve? and does that
 |>|problem actually exist in the first place?
 |> 
 |> The problem is that if have a low upper bound then modulo will "remove a 
 |> lot of randomization".  For example if you have a program which 
 |> generates Lotto numbers (1..49), then using _uniform() as it is will 
 |> generate many duplicates.
 |
 |Wut.  The *WHOLE POINT* of arc4random_uniform() is that it has uniform 
 |distribution.  Says right so in the manpage.  If an implementation of that 
 |API fails to do that, it's a broken implementation.

I always admired its source code comments and have been left
stunning more than just once.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-16 Thread Sven M . Hallberg
Luke Small on Sun, May 15 2022:
> The current implementation is nothing more than a naive arc4random() %
> upper_bound which trashes initial arc4random() calls it doesn’t like,

Yes, that's exactly what it is.

> The whole transformation by modulus of perfectly decent random data
> seems so awkward. [...] it seems like an ugly HACK

Let me point out that this way of doing it is absolutely standard. It is
not awkward or a hack.

> to simply meet demand

This is probably a good way of describing the design goal for this
highly sensitive piece of kernel interna.

-p

PS: I am not an OpenBSD developer, a tiny contributor at most. I speak
as a bystander who happens to have some domain knowledge.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
Yeah. It most likely won't go in. From past experience and advice, not
necessarily just from a perceived lack of merit.

However, many, if not all of the arguments are based upon non-facts and
misconceptions from earlier submissions or just not understanding what the
software is doing.

The only real thing that makes it not quite as good as the standard is mine
isn't threadsafe. If it could be accepted as a higher performance,
non-threadsafe call, it would perform better in many typical cases and
perhaps even give a safer return value, especially in large upper_bound
edge cases, I suspect.

There is the specifically non-threadsafe call getchar_unlocked() on OpenBSD
which is presumably available for performance reasons alone, when getchar()
is a perfectly viable option and is even an ISO conforming function. What I
submitted could be such a higher performance non-threadsafe function.

so, how about arc4random_uniform_unlocked() ?!

...other than making upper_bound a uint32_t instead of the submitted
uint64_t. That'd be somewhat of a problem.

-Luke


> You've had several developers tell you this is not going to go in. I'd
> suggest
> "read the room".
>
> If you want this for your own use, just keep it as a local diff. Nobody
> will
> know (or likely care).
>
> -ml
>


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
I’m not trying to be rude, but you don’t realize what’s going on here:

uuu is a bitmask:

‘uuu’ (or (1 << bits)-1 ) in “ret = rand_holder & uuu;“ , only puts the
lower ‘bit’ quantity of bits of rand_holder into ret, then it right shifts
rand_holder afterward to trash them every time in the loop when it’s done.

So if bits is 8, uuu is going to be 0xff

No, you aren't:
>
> > for (;;) {
> > if (rand_bits < bits) {
> > rand_holder |= ((uint64_t)arc4random()) <<
> > rand_bits;
> >
> > /*
> >  * rand_bits will be a number between 0 and 31
> here
> >  * so the 0x20 bit will be empty
> >  * rand_bits += 32;
> >  */
> > rand_bits |= 32;
> > }
> >
> > ret = rand_holder & uuu;
> > rand_holder >>= bits;
> > rand_bits -= bits;
> >
> > if (ret < upper_bound)
> > return ret;
> > }
>
> This isn't rejection sampling. This is reusing part of the rejected
> sample.

-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Mike Larkin
On Sun, May 15, 2022 at 08:40:19PM -0500, Luke Small wrote:
> https://marc.info/?l=openbsd-tech=165259528425835=2
>
> This one (which is strongly based upon my first of two versions) which I
> submitted after Guenther correctly trashed version 2, doesn’t reuse any
> part of the sample. It picks up a clean new bitfield upon failure.
>
> I think Guenther didn’t, perhaps like yourself, realize I submitted this
> later program. That’s why he said it wasn’t correct. It didn’t occur to me
> at the time of responding to him: “correct correct correct.”
>

You've had several developers tell you this is not going to go in. I'd suggest
"read the room".

If you want this for your own use, just keep it as a local diff. Nobody will
know (or likely care).

-ml

> On Sun, May 15, 2022 at 7:47 PM Damien Miller  wrote:
>
> > On Sat, 14 May 2022, Luke Small wrote:
> >
> > > Look at my code. I don’t even use a modulus operator. I perform hit and
> > > miss with a random bitstream.
> > >
> > > How can I have a bias of something I don’t do? I return a bitstream which
> > > meets the parameters of being a value less than the upper bound. Much
> > like
> > > arc4random_buf().
> > >
> > > If I use arc4random_uniform() repeatedly to create a random distribution
> > of
> > > say numbers less than 0x1000 or even something weird like 0x1300 will the
> > > random distribution be better with arc4random_uniform() or with mine? For
> > > 0x1000 mine will simply pluck 12 bits of random data straight from the
> > > arc4random() (and preserve the remaining 20 bits for later) on the first
> > > try, just like it’s arc4random_buf().
> > >
> > > arc4random_uniform() will perform a modulus of a 32 bit number which adds
> > > data to the bitstream. Does it make it better? Perhaps it makes it harder
> > > to guess the source bits.
> > >
> > > I don’t know; and I’m not going to pretend to be a cryptologist. But I’m
> > > looking at modulo bias.
> > >
> > > I didn’t know what it was, before, but I basically “rejection sample”:
> > >
> > >
> > https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/
> >
> > No, you aren't:
> >
> > > for (;;) {
> > > if (rand_bits < bits) {
> > > rand_holder |= ((uint64_t)arc4random()) <<
> > > rand_bits;
> > >
> > > /*
> > >  * rand_bits will be a number between 0 and 31
> > here
> > >  * so the 0x20 bit will be empty
> > >  * rand_bits += 32;
> > >  */
> > > rand_bits |= 32;
> > > }
> > >
> > > ret = rand_holder & uuu;
> > > rand_holder >>= bits;
> > > rand_bits -= bits;
> > >
> > > if (ret < upper_bound)
> > > return ret;
> > > }
> >
> > This isn't rejection sampling. This is reusing part of the rejected
> > sample.
> >
> > Think of it like this: you want to uniformly generate a number in the
> > range [2:10] by rolling 2x 6-sided dice. What do you do when you roll
> > 11 or 12? You can't just reroll one of the dice because the other dice
> > is constrained to be have rolled either 5 or 6, and so proceeding with
> > it would force the output to be in the range [6:11] for these ~5.6%
> > of initial rolls. Your output is no longer uniform.
> >
> > BTW the existing code already implements the prefered approach of the
> > article you quoted.
> >
> > -d
>
> --
> -Luke



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
https://marc.info/?l=openbsd-tech=165259528425835=2

This one (which is strongly based upon my first of two versions) which I
submitted after Guenther correctly trashed version 2, doesn’t reuse any
part of the sample. It picks up a clean new bitfield upon failure.

I think Guenther didn’t, perhaps like yourself, realize I submitted this
later program. That’s why he said it wasn’t correct. It didn’t occur to me
at the time of responding to him: “correct correct correct.”

On Sun, May 15, 2022 at 7:47 PM Damien Miller  wrote:

> On Sat, 14 May 2022, Luke Small wrote:
>
> > Look at my code. I don’t even use a modulus operator. I perform hit and
> > miss with a random bitstream.
> >
> > How can I have a bias of something I don’t do? I return a bitstream which
> > meets the parameters of being a value less than the upper bound. Much
> like
> > arc4random_buf().
> >
> > If I use arc4random_uniform() repeatedly to create a random distribution
> of
> > say numbers less than 0x1000 or even something weird like 0x1300 will the
> > random distribution be better with arc4random_uniform() or with mine? For
> > 0x1000 mine will simply pluck 12 bits of random data straight from the
> > arc4random() (and preserve the remaining 20 bits for later) on the first
> > try, just like it’s arc4random_buf().
> >
> > arc4random_uniform() will perform a modulus of a 32 bit number which adds
> > data to the bitstream. Does it make it better? Perhaps it makes it harder
> > to guess the source bits.
> >
> > I don’t know; and I’m not going to pretend to be a cryptologist. But I’m
> > looking at modulo bias.
> >
> > I didn’t know what it was, before, but I basically “rejection sample”:
> >
> >
> https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/
>
> No, you aren't:
>
> > for (;;) {
> > if (rand_bits < bits) {
> > rand_holder |= ((uint64_t)arc4random()) <<
> > rand_bits;
> >
> > /*
> >  * rand_bits will be a number between 0 and 31
> here
> >  * so the 0x20 bit will be empty
> >  * rand_bits += 32;
> >  */
> > rand_bits |= 32;
> > }
> >
> > ret = rand_holder & uuu;
> > rand_holder >>= bits;
> > rand_bits -= bits;
> >
> > if (ret < upper_bound)
> > return ret;
> > }
>
> This isn't rejection sampling. This is reusing part of the rejected
> sample.
>
> Think of it like this: you want to uniformly generate a number in the
> range [2:10] by rolling 2x 6-sided dice. What do you do when you roll
> 11 or 12? You can't just reroll one of the dice because the other dice
> is constrained to be have rolled either 5 or 6, and so proceeding with
> it would force the output to be in the range [6:11] for these ~5.6%
> of initial rolls. Your output is no longer uniform.
>
> BTW the existing code already implements the prefered approach of the
> article you quoted.
>
> -d

-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Damien Miller
On Sat, 14 May 2022, Luke Small wrote:

> Look at my code. I don’t even use a modulus operator. I perform hit and
> miss with a random bitstream.
> 
> How can I have a bias of something I don’t do? I return a bitstream which
> meets the parameters of being a value less than the upper bound. Much like
> arc4random_buf().
> 
> If I use arc4random_uniform() repeatedly to create a random distribution of
> say numbers less than 0x1000 or even something weird like 0x1300 will the
> random distribution be better with arc4random_uniform() or with mine? For
> 0x1000 mine will simply pluck 12 bits of random data straight from the
> arc4random() (and preserve the remaining 20 bits for later) on the first
> try, just like it’s arc4random_buf().
> 
> arc4random_uniform() will perform a modulus of a 32 bit number which adds
> data to the bitstream. Does it make it better? Perhaps it makes it harder
> to guess the source bits.
> 
> I don’t know; and I’m not going to pretend to be a cryptologist. But I’m
> looking at modulo bias.
> 
> I didn’t know what it was, before, but I basically “rejection sample”:
> 
> https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/

No, you aren't:

> for (;;) {
> if (rand_bits < bits) {
> rand_holder |= ((uint64_t)arc4random()) <<
> rand_bits;
> 
> /* 
>  * rand_bits will be a number between 0 and 31 here
>  * so the 0x20 bit will be empty
>  * rand_bits += 32;
>  */ 
> rand_bits |= 32;
> }
> 
> ret = rand_holder & uuu;
> rand_holder >>= bits;
> rand_bits -= bits;
> 
> if (ret < upper_bound)
> return ret;
> }

This isn't rejection sampling. This is reusing part of the rejected
sample.

Think of it like this: you want to uniformly generate a number in the
range [2:10] by rolling 2x 6-sided dice. What do you do when you roll
11 or 12? You can't just reroll one of the dice because the other dice
is constrained to be have rolled either 5 or 6, and so proceeding with
it would force the output to be in the range [6:11] for these ~5.6%
of initial rolls. Your output is no longer uniform.

BTW the existing code already implements the prefered approach of the
article you quoted.

-d


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Damien Miller
On Sun, 15 May 2022, Luke Small wrote:

> Do I really have to use specific terminology to make a point?
>
> I'm not educated enough on chacha20 enough to know whether, like I
> pointed out, whether choosing 5 bits from the middle of (or even from
> the tail end of one and the beginning of another) 32 bit pseudorandom
> cipher is "correct."

You don't need to understand chacha20 to understand.
arc4random_uniform() I certainly didn't when I wrote it.

The underlying CSPRNG is irrelevant to how arc4random_uniform() works.
It it treated as an oracle that provides 32 random bit upon request. You
could swap it out for 32 coin-tossing monkeys and the implementation
wouldn't need to change.

It requests another 32 bit random value for each attempt at satisfying
the bounds check because they need to be independent - reusing parts of
a previous attempt is highly likely to introduce biases.

It's almost certainly possible to make this function faster, but it's
also very easy to get it wrong (e.g. I made one stupid math error in its
early implementation, forever immortalised by CVS). The existing code
has the advantage of being very obvious in how it works and therefore
has a very low risk of being wrong.

If someone is proposing to move to something less obvious then it's
incumbent upon them to do the work to prove that their alternative is
just as correct.

> ...correct correct correct. Did I use that buzzword enough?

Highly experienced people are taking he time to give you detailed,
critical feedback. This can be hard to receive, but if you ever want to
improve then you should consider it and try to engage constructively.

-d



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
Do I really have to use specific terminology to make a point?

I'm not educated enough on chacha20 enough to know whether, like I pointed
out, whether choosing 5 bits from the middle of (or even from the tail end
of one and the beginning of another) 32 bit pseudorandom cipher is
"correct."

...correct correct correct. Did I use that buzzword enough?

-Luke


On Sun, May 15, 2022 at 5:26 PM Philip Guenther  wrote:

> On Sun, 15 May 2022, Luke Small wrote:
> > The current implementation is nothing more than a naive arc4random() %
> > upper_bound which trashes initial arc4random() calls it doesn’t like,
> then
> > transforms over a desired modulus. The whole transformation by modulus of
> > perfectly decent random data seems so awkward. It’s not like it is used
> as
> > some majestic artistry of RSA it seems like an ugly HACK to simply meet a
> > demand lacking of something better.
>
> You fail to mention correctness at all or address the fact that your
> version isn't while the current one is.  Meanwhile, you talk about getting
> only just enough random data as if there's some sort of limited supply
> when there isn't.
>
> "My version may be wrong, but at least it doesn't look naive!"
>
> That is utterly the wrong attitude for OpenBSD code.
>
>
> Best wishes.
>
> Philip Guenther
>


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Philip Guenther
On Sun, 15 May 2022, Luke Small wrote:
> The current implementation is nothing more than a naive arc4random() %
> upper_bound which trashes initial arc4random() calls it doesn’t like, then
> transforms over a desired modulus. The whole transformation by modulus of
> perfectly decent random data seems so awkward. It’s not like it is used as
> some majestic artistry of RSA it seems like an ugly HACK to simply meet a
> demand lacking of something better.

You fail to mention correctness at all or address the fact that your 
version isn't while the current one is.  Meanwhile, you talk about getting 
only just enough random data as if there's some sort of limited supply 
when there isn't.

"My version may be wrong, but at least it doesn't look naive!"

That is utterly the wrong attitude for OpenBSD code.


Best wishes.

Philip Guenther



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
The current implementation is nothing more than a naive arc4random() %
upper_bound which trashes initial arc4random() calls it doesn’t like, then
transforms over a desired modulus. The whole transformation by modulus of
perfectly decent random data seems so awkward. It’s not like it is used as
some majestic artistry of RSA it seems like an ugly HACK to simply meet a
demand lacking of something better.

If you understand what I’ve done, it streams in a bitfield into an integer
type like it’s a buffer for just enough or slightly more data to meet the
demands of the upperbound and if it exceeds upperbound-1, it is trashed and
reads in a completely new bitfield to check. It relies on arc4random()
supplying good random data regardless of how many bits are in the bitfield.
If it does so, it should and seems to supply a good distribution across the
length of the bitfield which may often be something like 5 for a common
26+26+10 upper_bound in /usr/src. It seems to me that it should be pretty
good if not superior method; at least in the realm of cleaner results.
Perhaps it’s confusing what I’ve done with all the bitwise operators, but
it isn’t some random hacky thing I’ve cobbled together.

Or does arc4random() only provide decent random data 32 bits at a time; or
an even 8 bits at a time as arc4random_buf() would suggest?

All I would have to prove is that chacha20 provides good or superior random
bitfields regardless of how many bits are needed and regardless of whether
they begin at the beginning of a byte.

I don’t have the education for that, but “I got a ‘puter for Christmas.”
lol. I can perhaps run simulations if I have nothing better to do.


> I think I can say we know here uniformity is only *one* of the
> desirable properties of a secure random generator.
>
> But I don't think anybody else execpt Luke was talking about
> "improving".  The sole purpose of arc4random_uniform() is to give a
> good implementation of a random number function in a specific range
> using arc4random() as the source. This is needed because the naive
> implementation arc4random() % upper_bound is not uniform.
>
> -Otto
>
>
> --
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Otto Moerbeek
On Sun, May 15, 2022 at 08:57:09AM -0300, Crystal Kolipe wrote:

> On Sun, May 15, 2022 at 11:44:29AM +0200, Otto Moerbeek wrote:
> > On Sun, May 15, 2022 at 04:27:30AM -0500, Luke Small wrote:
> > > How did someone prove the current implementation was cryptographically
> > > sound? Did they run massive simulations which ran the gamut of the 
> > > uint32_t
> > > range which demanded tight tolerances over varying length runs?
> > > 
> > > How was rc4 cipher proven bad for pseudorandom numbers? I???d be willing 
> > > to
> > > bet it wasn???t from a mathematical proof; it was from bad data.
> 
> > You miss the point completely. The point is: how do you derive a
> > uniformly distributed random function for a smaller range, given a
> > uniformly distibuted random function over the range [0-2^32-1].
> > 
> > The current implementation does exactly that and has all the
> > information in the comments to verify the uniformity claim. You only
> > need to use basic mathematical properties of modulo arithmetic to
> > do the verification.
> 
> You do all realise that uniform distribution alone does not make a
> good random number generator, don't you?
> 
> For example:
> 
> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5...
> 
> That's uniformly distributed, and also useless as a random number
> stream.
> 
> Further more, _cryptographically secure_ random number generation is
> not the same as _mathematically good_ random number generation.
> 
> There are plenty of random number generation formulas which are
> considered good and useful from a mathematical basis, but which are
> useless for cryptography.
> 
> So random, (pun intended), hacks at the arc4random code are not
> likely to 'improve' it from the general standpoint, (although if you
> have a specific need for a specific private application, that's
> different).  I think Stuart has already more or less made that point.
> 

I think I can say we know here uniformity is only *one* of the
desirable properties of a secure random generator.

But I don't think anybody else execpt Luke was talking about
"improving".  The sole purpose of arc4random_uniform() is to give a
good implementation of a random number function in a specific range
using arc4random() as the source. This is needed because the naive
implementation arc4random() % upper_bound is not uniform.

-Otto




Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Crystal Kolipe
On Sun, May 15, 2022 at 11:44:29AM +0200, Otto Moerbeek wrote:
> On Sun, May 15, 2022 at 04:27:30AM -0500, Luke Small wrote:
> > How did someone prove the current implementation was cryptographically
> > sound? Did they run massive simulations which ran the gamut of the uint32_t
> > range which demanded tight tolerances over varying length runs?
> > 
> > How was rc4 cipher proven bad for pseudorandom numbers? I???d be willing to
> > bet it wasn???t from a mathematical proof; it was from bad data.

> You miss the point completely. The point is: how do you derive a
> uniformly distributed random function for a smaller range, given a
> uniformly distibuted random function over the range [0-2^32-1].
> 
> The current implementation does exactly that and has all the
> information in the comments to verify the uniformity claim. You only
> need to use basic mathematical properties of modulo arithmetic to
> do the verification.

You do all realise that uniform distribution alone does not make a
good random number generator, don't you?

For example:

1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5...

That's uniformly distributed, and also useless as a random number
stream.

Further more, _cryptographically secure_ random number generation is
not the same as _mathematically good_ random number generation.

There are plenty of random number generation formulas which are
considered good and useful from a mathematical basis, but which are
useless for cryptography.

So random, (pun intended), hacks at the arc4random code are not
likely to 'improve' it from the general standpoint, (although if you
have a specific need for a specific private application, that's
different).  I think Stuart has already more or less made that point.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Marc Espie
Random generators have been analyzed for years. 

Pick up "Concrete mathematics" by Don E. Knuth, read it, then come back
with actual science.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Otto Moerbeek
On Sun, May 15, 2022 at 04:27:30AM -0500, Luke Small wrote:

> I have a feeling that making it threadsafe under any quantity of threads
> and still perform is likely not feasible, but if I could; or if making a
> nonthreadsafe variant was possible:
> 
> Creating a mathematical proof that somehow is “simple and probably correct”
> enough, would be an impossible task even for a PhD mathematician.
> 
> How did someone prove the current implementation was cryptographically
> sound? Did they run massive simulations which ran the gamut of the uint32_t
> range which demanded tight tolerances over varying length runs?
> 
> How was rc4 cipher proven bad for pseudorandom numbers? I’d be willing to
> bet it wasn’t from a mathematical proof; it was from bad data.
> 
> I’m guessing that large upperbounds approaching 2**32 don’t feed very
> soundly in the current implementation using a modulus; although I suspect
> that there isn’t much of a call for such things, I’m pretty sure I saw a
> 3,000,000,000 upperbound in the src partition tonight.


You miss the point completely. The point is: how do you derive a
uniformly distributed random function for a smaller range, given a
uniformly distibuted random function over the range [0-2^32-1].

The current implementation does exactly that and has all the
information in the comments to verify the uniformity claim. You only
need to use basic mathematical properties of modulo arithmetic to
do the verification.

-Otto

> 
> On Sun, May 15, 2022 at 3:15 AM Otto Moerbeek  wrote:
> 
> > On Sun, May 15, 2022 at 01:12:28AM -0500, Luke Small wrote:
> >
> > > This is version 1, which I was pretty sure was secure.
> > >
> > > I revamped it with a few features and implanted the binary search for
> > 'bit'
> > >
> > > in most cases, which aren't intentionally worst-case, it's pretty darned
> > > fast!
> > >
> > > This is a sample run of my program with your piece of code included:
> > >
> > >
> > >  1  99319 100239
> > >  2 100353  99584
> > >  3 100032  99879
> > >  4  99704 100229
> > >  5  99530  99796
> > >  6  99617 100355
> > >  7 100490 100319
> > >  8  99793 100114
> > >  9 100426  99744
> > > 10 100315 100558
> > > 11  99279  99766
> > > 12  99572  99750
> > > 13  99955 100017
> > > 14 100413 15
> > > 15 100190 100052
> > > 16 101071 100195
> > > 17 100322 100224
> > > 18  99637  99540
> > > 19 100323  99251
> > > 20  99841 100177
> > > 21  99948  99504
> > > 22 100065 100031
> > > 23 100026  99827
> > > 24  99836  99818
> > > 25 100245  99822
> > > 26 100088  99678
> > > 27  99957  3
> > > 28 100065  99961
> > > 29 100701 100679
> > > 30  99756  99587
> > > 31 100220 100076
> > > 32 100067 15
> > > 33  99547  99984
> > > 34 100124 100031
> > > 35  99547 100661
> > > 36  99801  99963
> > > 37 100189 100230
> > > 38  99878  99579
> > > 39  99864  99442
> > > 40  99683 14
> > > 41  99907 100094
> > > 42 100291  99817
> > > 43  99908  99984
> > > 44 100044 100606
> > > 45 100065 100120
> > > 46  99358 100141
> > > 47 100152 100442
> > > 48 10 100279
> > > 49 100486  99848
> >
> > Sadly a sample run cannot be used to proof your implementation is
> > correct.  It can only be used to show it is not correct, like Philip
> > did.  To show your implementation produces uniform results in all
> > cases, you need to provide a solid argumentation that is easy to
> > follow. So far you failed to do that and I do not see it coming, given
> > the complexituy of your implementation.  The current implementation
> > has a straightforward argumentation as it uses relatively simple
> > mathematical properties of modulo arithmetic.
> >
> > It is also clear your code (as it uses statics) is not thread safe.
> >
> > So to answer you original question clearly: no, we will not accept this.
> >
> > -Otto
> >
> -- 
> -Luke



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
I have a feeling that making it threadsafe under any quantity of threads
and still perform is likely not feasible, but if I could; or if making a
nonthreadsafe variant was possible:

Creating a mathematical proof that somehow is “simple and probably correct”
enough, would be an impossible task even for a PhD mathematician.

How did someone prove the current implementation was cryptographically
sound? Did they run massive simulations which ran the gamut of the uint32_t
range which demanded tight tolerances over varying length runs?

How was rc4 cipher proven bad for pseudorandom numbers? I’d be willing to
bet it wasn’t from a mathematical proof; it was from bad data.

I’m guessing that large upperbounds approaching 2**32 don’t feed very
soundly in the current implementation using a modulus; although I suspect
that there isn’t much of a call for such things, I’m pretty sure I saw a
3,000,000,000 upperbound in the src partition tonight.

On Sun, May 15, 2022 at 3:15 AM Otto Moerbeek  wrote:

> On Sun, May 15, 2022 at 01:12:28AM -0500, Luke Small wrote:
>
> > This is version 1, which I was pretty sure was secure.
> >
> > I revamped it with a few features and implanted the binary search for
> 'bit'
> >
> > in most cases, which aren't intentionally worst-case, it's pretty darned
> > fast!
> >
> > This is a sample run of my program with your piece of code included:
> >
> >
> >  1  99319 100239
> >  2 100353  99584
> >  3 100032  99879
> >  4  99704 100229
> >  5  99530  99796
> >  6  99617 100355
> >  7 100490 100319
> >  8  99793 100114
> >  9 100426  99744
> > 10 100315 100558
> > 11  99279  99766
> > 12  99572  99750
> > 13  99955 100017
> > 14 100413 15
> > 15 100190 100052
> > 16 101071 100195
> > 17 100322 100224
> > 18  99637  99540
> > 19 100323  99251
> > 20  99841 100177
> > 21  99948  99504
> > 22 100065 100031
> > 23 100026  99827
> > 24  99836  99818
> > 25 100245  99822
> > 26 100088  99678
> > 27  99957  3
> > 28 100065  99961
> > 29 100701 100679
> > 30  99756  99587
> > 31 100220 100076
> > 32 100067 15
> > 33  99547  99984
> > 34 100124 100031
> > 35  99547 100661
> > 36  99801  99963
> > 37 100189 100230
> > 38  99878  99579
> > 39  99864  99442
> > 40  99683 14
> > 41  99907 100094
> > 42 100291  99817
> > 43  99908  99984
> > 44 100044 100606
> > 45 100065 100120
> > 46  99358 100141
> > 47 100152 100442
> > 48 10 100279
> > 49 100486  99848
>
> Sadly a sample run cannot be used to proof your implementation is
> correct.  It can only be used to show it is not correct, like Philip
> did.  To show your implementation produces uniform results in all
> cases, you need to provide a solid argumentation that is easy to
> follow. So far you failed to do that and I do not see it coming, given
> the complexituy of your implementation.  The current implementation
> has a straightforward argumentation as it uses relatively simple
> mathematical properties of modulo arithmetic.
>
> It is also clear your code (as it uses statics) is not thread safe.
>
> So to answer you original question clearly: no, we will not accept this.
>
> -Otto
>
-- 
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Otto Moerbeek
On Sun, May 15, 2022 at 01:12:28AM -0500, Luke Small wrote:

> This is version 1, which I was pretty sure was secure.
> 
> I revamped it with a few features and implanted the binary search for 'bit'
> 
> in most cases, which aren't intentionally worst-case, it's pretty darned
> fast!
> 
> This is a sample run of my program with your piece of code included:
> 
> 
>  1  99319 100239
>  2 100353  99584
>  3 100032  99879
>  4  99704 100229
>  5  99530  99796
>  6  99617 100355
>  7 100490 100319
>  8  99793 100114
>  9 100426  99744
> 10 100315 100558
> 11  99279  99766
> 12  99572  99750
> 13  99955 100017
> 14 100413 15
> 15 100190 100052
> 16 101071 100195
> 17 100322 100224
> 18  99637  99540
> 19 100323  99251
> 20  99841 100177
> 21  99948  99504
> 22 100065 100031
> 23 100026  99827
> 24  99836  99818
> 25 100245  99822
> 26 100088  99678
> 27  99957  3
> 28 100065  99961
> 29 100701 100679
> 30  99756  99587
> 31 100220 100076
> 32 100067 15
> 33  99547  99984
> 34 100124 100031
> 35  99547 100661
> 36  99801  99963
> 37 100189 100230
> 38  99878  99579
> 39  99864  99442
> 40  99683 14
> 41  99907 100094
> 42 100291  99817
> 43  99908  99984
> 44 100044 100606
> 45 100065 100120
> 46  99358 100141
> 47 100152 100442
> 48 10 100279
> 49 100486  99848

Sadly a sample run cannot be used to proof your implementation is
correct.  It can only be used to show it is not correct, like Philip
did.  To show your implementation produces uniform results in all
cases, you need to provide a solid argumentation that is easy to
follow. So far you failed to do that and I do not see it coming, given
the complexituy of your implementation.  The current implementation
has a straightforward argumentation as it uses relatively simple
mathematical properties of modulo arithmetic.

It is also clear your code (as it uses statics) is not thread safe.

So to answer you original question clearly: no, we will not accept this.

-Otto



Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Luke Small
This is version 1, which I was pretty sure was secure.

I revamped it with a few features and implanted the binary search for 'bit'

in most cases, which aren't intentionally worst-case, it's pretty darned
fast!

This is a sample run of my program with your piece of code included:


 1  99319 100239
 2 100353  99584
 3 100032  99879
 4  99704 100229
 5  99530  99796
 6  99617 100355
 7 100490 100319
 8  99793 100114
 9 100426  99744
10 100315 100558
11  99279  99766
12  99572  99750
13  99955 100017
14 100413 15
15 100190 100052
16 101071 100195
17 100322 100224
18  99637  99540
19 100323  99251
20  99841 100177
21  99948  99504
22 100065 100031
23 100026  99827
24  99836  99818
25 100245  99822
26 100088  99678
27  99957  3
28 100065  99961
29 100701 100679
30  99756  99587
31 100220 100076
32 100067 15
33  99547  99984
34 100124 100031
35  99547 100661
36  99801  99963
37 100189 100230
38  99878  99579
39  99864  99442
40  99683 14
41  99907 100094
42 100291  99817
43  99908  99984
44 100044 100606
45 100065 100120
46  99358 100141
47 100152 100442
48 10 100279
49 100486  99848








newdata
time = 0.240530757 seconds

newdata_fast
time = 0.073868626 seconds

The original takes 325.620% of the runtime of newdata_fast






newdataTypableFilename
time = 0.236748989 seconds

newdataTypableFilename_fast
time = 0.115842333 seconds

The original takes 204.372% of the runtime of newdataTypableFilename_fast






rand_ideal
time = 0.235832820 seconds

rand_ideal_fast
time = 0.025300798 seconds

The original takes 932.116% of the runtime of rand_ideal_fast






rand_short_ideal
time = 0.236828684 seconds

rand_short_ideal_fast
time = 0.124025922 seconds

The original takes 190.951% of the runtime of rand_short_ideal_fast






rand_short_worst
time = 0.237142869 seconds

rand_short_worst_fast
time = 0.294156486 seconds

The original takes 80.618% of the runtime of rand_short_worst_fast






rand_worst
time = 0.237002775 seconds

rand_worst_fast
time = 0.377148420 seconds

The original takes 62.841% of the runtime of rand_worst_fast






shuffle
time = 0.003044472 seconds

shuffle_fast
time = 0.002590664 seconds

The original takes 117.517% of the runtime of shuffle_fast

If it crashed here, you are trying to profile.
Turn off pledge at the beginning of main().



On Sat, May 14, 2022 at 7:03 PM Philip Guenther  wrote:

> On Sun, 15 May 2022, Steffen Nurpmeso wrote:
> > Stuart Henderson wrote in
> ...
> >  |what's the perceived problem you're wanting to solve? and does that
> >  |problem actually exist in the first place?
> >
> > The problem is that if have a low upper bound then modulo will "remove a
> > lot of randomization".  For example if you have a program which
> > generates Lotto numbers (1..49), then using _uniform() as it is will
> > generate many duplicates.
>
> Wut.  The *WHOLE POINT* of arc4random_uniform() is that it has uniform
> distribution.  Says right so in the manpage.  If an implementation of that
> API fails to do that, it's a broken implementation.
>
> The OpenBSD implementation achieves that.  NetBSD's implementation has the
> same core logic.  Here's my quick lotto pick demo program, doing 490
> picks, so they should all be somewhere near 10:
>
> #include 
> #include 
> #define LIMIT   49
> int
> main(void)
> {
> unsigned counts[LIMIT] = { 0 };
> int i;
> for (i = 0; i < 10 * LIMIT; i++)
> counts[arc4random_uniform(LIMIT)]++;
> for (i = 0; i < LIMIT; i++)
> printf("%2d\t%7u\n", i+1, counts[i]);
> return 0;
> }
>
> And a sample run:
>
> : bleys; ./a.out
>
>  1   100100
>  2   100639
>  399965
>  499729
>  599641
>  699650
>  7   100299
>  8   100164
>  999791
> 10   100101
> 11   100657
> 12   100042
> 1399661
> 1499927
> 1599426
> 1699491
> 1799646
> 18   100133
> 19   100013
> 2099942
> 2199873
> 2299924
> 2399567
> 24   100152
> 25   100688
> 26   100011
> 27   100481
> 2899980
> 29   100406
> 3099726
> 3199808
> 3299929
> 33   100050
> 3499983
> 35   100048
> 3699771
> 3799906
> 38   100215
> 39   100261
> 40   100426
> 4199847
> 4299533
> 43   100368
> 4499695
> 45   100041
> 46   100465
> 4799875
> 48   100034
> 4999920
> : bleys;
>
> Looks pretty good to me, with repeated runs showing the values bouncing
> around.
>
>
> ...
> > I was a bit interested when i saw Luke's message, but i am no
> > mathematician and, like i said, i in fact never needed the
> > functionality.  And the question i would have is how small
> > upper_bound has to be for the modulo problem to really kick in.
>
> If the implementation isn't broken, it's not a problem, period.
>
>
> > And even if, whether it 

Re: Picky, but much more efficient arc4random_uniform!

2022-05-15 Thread Otto Moerbeek
On Sat, May 14, 2022 at 05:03:17PM -0700, Philip Guenther wrote:

> On Sun, 15 May 2022, Steffen Nurpmeso wrote:
> > Stuart Henderson wrote in
> ...
> >  |what's the perceived problem you're wanting to solve? and does that
> >  |problem actually exist in the first place?
> > 
> > The problem is that if have a low upper bound then modulo will "remove a 
> > lot of randomization".  For example if you have a program which 
> > generates Lotto numbers (1..49), then using _uniform() as it is will 
> > generate many duplicates.
> 
> Wut.  The *WHOLE POINT* of arc4random_uniform() is that it has uniform 
> distribution.  Says right so in the manpage.  If an implementation of that 
> API fails to do that, it's a broken implementation.
> 
> The OpenBSD implementation achieves that.  NetBSD's implementation has the 
> same core logic.  Here's my quick lotto pick demo program, doing 490 
> picks, so they should all be somewhere near 10:
> 
> #include 
> #include 
> #define LIMIT   49
> int
> main(void)
> {
> unsigned counts[LIMIT] = { 0 };
> int i;
> for (i = 0; i < 10 * LIMIT; i++)
> counts[arc4random_uniform(LIMIT)]++;
> for (i = 0; i < LIMIT; i++)
> printf("%2d\t%7u\n", i+1, counts[i]);
> return 0;
> }
> 
> And a sample run:
> 
> : bleys; ./a.out  
>  
>  1   100100
>  2   100639
>  399965
>  499729
>  599641
>  699650
>  7   100299
>  8   100164
>  999791
> 10   100101
> 11   100657
> 12   100042
> 1399661
> 1499927
> 1599426
> 1699491
> 1799646
> 18   100133
> 19   100013
> 2099942
> 2199873
> 2299924
> 2399567
> 24   100152
> 25   100688
> 26   100011
> 27   100481
> 2899980
> 29   100406
> 3099726
> 3199808
> 3299929
> 33   100050
> 3499983
> 35   100048
> 3699771
> 3799906
> 38   100215
> 39   100261
> 40   100426
> 4199847
> 4299533
> 43   100368
> 4499695
> 45   100041
> 46   100465
> 4799875
> 48   100034
> 4999920
> : bleys; 
> 
> Looks pretty good to me, with repeated runs showing the values bouncing 
> around.
> 
> 
> ...
> > I was a bit interested when i saw Luke's message, but i am no
> > mathematician and, like i said, i in fact never needed the
> > functionality.  And the question i would have is how small
> > upper_bound has to be for the modulo problem to really kick in.
> 
> If the implementation isn't broken, it's not a problem, period.
> 
> 
> > And even if, whether it matters.  For example, if you have
> > a bitset of 1024 "in-flight things", and the small upper bound
> > hits a used slot, you could simply search forward until you find
> > a free one.  I mean, with 1024 possible values the number of
> > possibilities is anyhow restricted.
> 
> Well hey, let's give Luke's a try and see how uniform it is:
> 
> 
> #include 
> #include 
> #define LIMIT   49
> int
> main(void)
> {
> unsigned counts[LIMIT] = { 0 };
> unsigned counts2[LIMIT] = { 0 };
> int i;
> for (i = 0; i < 10 * LIMIT; i++) {
> counts[arc4random_uniform(LIMIT)]++;
> counts2[arc4random_uniform_fast_simple(LIMIT)]++;
> }
> for (i = 0; i < LIMIT; i++)
> printf("%2d\t%7u\t%7u\n", i+1, counts[i], counts2[i]);
> return 0;
> }
> 
> : bleys; ./a.out  
>  1   100188   76502
>  299983   76602
>  3   100521   76522
>  4   100416   76682
>  5   100171   76387
>  6   100163   76759
>  7   100024   76336
>  8   19   76703
>  999769   76237
> 1099892   76532
> 11   100197   76730
> 12   100483   76398
> 1399769   76310
> 14   100075   76474
> 1599781   76599
> 1699846   76439
> 1799814   76430
> 18   100313   76648
> 19   100259   76813
> 2099885   77068
> 21   100302   76546
> 228   76698
> 2399491   76678
> 24   100340   76324
> 2599763  115263
> 2699872  153008
> 27   100022  152979
> 2899481  153793
> 29   100018  210714
> 3099617  229286
> 31   100167  297003
> 32   100270  449664
> 33   100468   76790
> 3499115   76452
> 3599921   76392
> 3699862   76140
> 37   100485   76607
> 38   100029   75885
> 3999577   76498
> 4099479   76727
> 41   100139   76746
> 42   100883   76698
> 43   100102   76474
> 4499801   76592
> 45   100117   76124
> 4699678   76417
> 4799770   76639
> 4899524   77034
> 49   100151   76658
> : bleys; 
> 
> Wow, that last column is *bad*.  

Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Philip Guenther
On Sun, 15 May 2022, Steffen Nurpmeso wrote:
> Stuart Henderson wrote in
...
>  |what's the perceived problem you're wanting to solve? and does that
>  |problem actually exist in the first place?
> 
> The problem is that if have a low upper bound then modulo will "remove a 
> lot of randomization".  For example if you have a program which 
> generates Lotto numbers (1..49), then using _uniform() as it is will 
> generate many duplicates.

Wut.  The *WHOLE POINT* of arc4random_uniform() is that it has uniform 
distribution.  Says right so in the manpage.  If an implementation of that 
API fails to do that, it's a broken implementation.

The OpenBSD implementation achieves that.  NetBSD's implementation has the 
same core logic.  Here's my quick lotto pick demo program, doing 490 
picks, so they should all be somewhere near 10:

#include 
#include 
#define LIMIT   49
int
main(void)
{
unsigned counts[LIMIT] = { 0 };
int i;
for (i = 0; i < 10 * LIMIT; i++)
counts[arc4random_uniform(LIMIT)]++;
for (i = 0; i < LIMIT; i++)
printf("%2d\t%7u\n", i+1, counts[i]);
return 0;
}

And a sample run:

: bleys; ./a.out
   
 1   100100
 2   100639
 399965
 499729
 599641
 699650
 7   100299
 8   100164
 999791
10   100101
11   100657
12   100042
1399661
1499927
1599426
1699491
1799646
18   100133
19   100013
2099942
2199873
2299924
2399567
24   100152
25   100688
26   100011
27   100481
2899980
29   100406
3099726
3199808
3299929
33   100050
3499983
35   100048
3699771
3799906
38   100215
39   100261
40   100426
4199847
4299533
43   100368
4499695
45   100041
46   100465
4799875
48   100034
4999920
: bleys; 

Looks pretty good to me, with repeated runs showing the values bouncing around.


...
> I was a bit interested when i saw Luke's message, but i am no
> mathematician and, like i said, i in fact never needed the
> functionality.  And the question i would have is how small
> upper_bound has to be for the modulo problem to really kick in.

If the implementation isn't broken, it's not a problem, period.


> And even if, whether it matters.  For example, if you have
> a bitset of 1024 "in-flight things", and the small upper bound
> hits a used slot, you could simply search forward until you find
> a free one.  I mean, with 1024 possible values the number of
> possibilities is anyhow restricted.

Well hey, let's give Luke's a try and see how uniform it is:


#include 
#include 
#define LIMIT   49
int
main(void)
{
unsigned counts[LIMIT] = { 0 };
unsigned counts2[LIMIT] = { 0 };
int i;
for (i = 0; i < 10 * LIMIT; i++) {
counts[arc4random_uniform(LIMIT)]++;
counts2[arc4random_uniform_fast_simple(LIMIT)]++;
}
for (i = 0; i < LIMIT; i++)
printf("%2d\t%7u\t%7u\n", i+1, counts[i], counts2[i]);
return 0;
}

: bleys; ./a.out  
 1   100188   76502
 299983   76602
 3   100521   76522
 4   100416   76682
 5   100171   76387
 6   100163   76759
 7   100024   76336
 8   19   76703
 999769   76237
1099892   76532
11   100197   76730
12   100483   76398
1399769   76310
14   100075   76474
1599781   76599
1699846   76439
1799814   76430
18   100313   76648
19   100259   76813
2099885   77068
21   100302   76546
228   76698
2399491   76678
24   100340   76324
2599763  115263
2699872  153008
27   100022  152979
2899481  153793
29   100018  210714
3099617  229286
31   100167  297003
32   100270  449664
33   100468   76790
3499115   76452
3599921   76392
3699862   76140
37   100485   76607
38   100029   75885
3999577   76498
4099479   76727
41   100139   76746
42   100883   76698
43   100102   76474
4499801   76592
45   100117   76124
4699678   76417
4799770   76639
4899524   77034
49   100151   76658
: bleys; 

Wow, that last column is *bad*.  Repeated runs show 25-32 are consistently 
high, 32 being *outrageously* off, and the others are low.

Luke's implementation does not correctly implement the API.  Doesn't 
matter if it's a million times faster when it doesn't deliver the goal.


Philip Guenther



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Steffen Nurpmeso
Stuart Henderson wrote in
 :
 |On 2022/05/14 06:56, Luke Small wrote:
 |> If I use arc4random_uniform() repeatedly to create a random distribution \
 |> of
 |> say numbers less than 0x1000 or even something weird like 0x1300 will the
 |> random distribution be better with arc4random_uniform() or with mine?
 |
 |there's no point to have a choice of different arc4random_uniform_xyz
 |with different characteristics, how is somebody going to pick the
 |"right" one?
 |
 |the bottom of netbsd's version of the arc4random(3) manual says it well:
 |
 | One may be tempted to create new APIs to accommodate different \
 | security
 | models and performance constraints without unpleasant surprises \
 | on dif-
 | ferent operating systems.  This should not be done lightly, though,
 | because there are already too many different choices, and too \
 | many oppor-
 | tunities for programmers to reach for one and pick the wrong one.
 |
 |what's the perceived problem you're wanting to solve? and does that
 |problem actually exist in the first place?

The problem is that if have a low upper bound then modulo will
"remove a lot of randomization".  For example if you have
a program which generates Lotto numbers (1..49), then using
_uniform() as it is will generate many duplicates.

I used the approach i found in a TeX file over twenty years ago
("RANDOM.TEX, v.1 (Donald Arseneau) from TeTeX,
texmf/tex/generic/misc/random.tex"; and now that i look, he
published a .v2 not too long ago), it was named _clip() as it
produces something in between a minimum and a maximum:

_max -= _min;
++_max;
_min = Maximum::sir - 2; // m - 2 = 2^(32|64) - 3
if (_max != 0)
_min /= _max;

for(;;) {
uir i;

i = _random;
i += M1::uir;
if(_min != 0)
i /= _min;

if(i >= _max && (i != 0 || _max != 0)) {
_random *= 16807; // just MUL
if(_random == 0)
_random = 1;
continue;
}
_random = i;
break;
}

_random += origmin;
_Assert(1 && _random >= origmin && _random <= origmax);

That is the random number is multiplied with a, eh, non-prime, as
long as we do not fit.
However i found this became a terrible mess with 64-bit numbers,
so i removed the function, aka did not "port it over" when
i implemented a random number generator for my own use cases.
Maybe i should have increased the multiplicator.

It is really, really bad.  :-)

I only ever used it for a Lotto number program anyhow, i think,
and so i thought it would be better to simply read a one byte
random and skip those that do not fit, instead of making them
always fit with modulo 50.  That is, application specific
workaround.  Things are less harmful for large numbers, anyway.
In fact i have to say i was astonished to see someone using this
function, not too long ago i saw a FreeBSD commit fly by which
uses it!

I was a bit interested when i saw Luke's message, but i am no
mathematician and, like i said, i in fact never needed the
functionality.  And the question i would have is how small
upper_bound has to be for the modulo problem to really kick in.
And even if, whether it matters.  For example, if you have
a bitset of 1024 "in-flight things", and the small upper bound
hits a used slot, you could simply search forward until you find
a free one.  I mean, with 1024 possible values the number of
possibilities is anyhow restricted.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Matthew Martin
int
main() {
int results[3] = { 0, 0, 0 };
for (int i = 0; i < 10; i++) {
results[arc4random_uniform_fast_simple(3)]++;
}
for (int i = 0; i < 3; i++)
printf("%d: %d\n", i, results[i]);

return 0;
}

% ./a.out
0: 24809
1: 50011
2: 25180

You can't reuse bits because they'll be biased.



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Stuart Henderson
On 2022/05/14 06:56, Luke Small wrote:
> If I use arc4random_uniform() repeatedly to create a random distribution of
> say numbers less than 0x1000 or even something weird like 0x1300 will the
> random distribution be better with arc4random_uniform() or with mine?

there's no point to have a choice of different arc4random_uniform_xyz
with different characteristics, how is somebody going to pick the
"right" one?

the bottom of netbsd's version of the arc4random(3) manual says it well:

 One may be tempted to create new APIs to accommodate different security
 models and performance constraints without unpleasant surprises on dif-
 ferent operating systems.  This should not be done lightly, though,
 because there are already too many different choices, and too many oppor-
 tunities for programmers to reach for one and pick the wrong one.

what's the perceived problem you're wanting to solve? and does that
problem actually exist in the first place?



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Luke Small
Look at my code. I don’t even use a modulus operator. I perform hit and
miss with a random bitstream.

How can I have a bias of something I don’t do? I return a bitstream which
meets the parameters of being a value less than the upper bound. Much like
arc4random_buf().

If I use arc4random_uniform() repeatedly to create a random distribution of
say numbers less than 0x1000 or even something weird like 0x1300 will the
random distribution be better with arc4random_uniform() or with mine? For
0x1000 mine will simply pluck 12 bits of random data straight from the
arc4random() (and preserve the remaining 20 bits for later) on the first
try, just like it’s arc4random_buf().

arc4random_uniform() will perform a modulus of a 32 bit number which adds
data to the bitstream. Does it make it better? Perhaps it makes it harder
to guess the source bits.

I don’t know; and I’m not going to pretend to be a cryptologist. But I’m
looking at modulo bias.

I didn’t know what it was, before, but I basically “rejection sample”:

https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/

On Sat, May 14, 2022 at 6:14 AM Otto Moerbeek  wrote:

> On Sat, May 14, 2022 at 05:48:10AM -0500, Luke Small wrote:
>
> > arc4random_uniform_fast2 that I made, streams in data from arc4random()
> and
> > uses the datastream directly and uses it as a bit by bit right "sliding
> > window" in the last loop. arc4random_uniform() uses a modulus which I is
> > simple to implement, but I wonder how cryptographically sound or even how
> > evenly it distributes. Adding a modulus seems sloppy without something
> > better. I did make arc4random_fast_simple() which merely takes an
> > upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever
> > the top function was into it which binary searches to find the correct
> size
> > bitfield (return value) needed to barely fit the upperbound while also
> > being able to discover every possible value below the upperbound. It
> isn't
> > as fast as arc4random_uniform_fast2 if it were used repeatedly after a
> > single use of arc4random_uniform_fast_bitsearch() , but it does exactly
> the
> > same thing and appears faster than repeatedly using arc4random_uniform()
> > and it's wasteful use of arc4random() and calling the expensive rekeying
> > function more often.
> >
> > It may be interesting to determine even without looking at performance,
> > whether arc4random_fast_simple() creates a more superior, secure use of
> the
> > chacha20 stream than arc4random_uniform() with the modulus. what exactly
> > does all that extra data from the modulus do to the random distribution?
> >
> > -Luke
>
> I don't follow you at all. Your blabbering does not even use the terms
> "uniform" and "modulo bias". I wonder even if you realize what they
> mean in this context.
>
> -Otto
>
> --
-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Otto Moerbeek
On Sat, May 14, 2022 at 05:48:10AM -0500, Luke Small wrote:

> arc4random_uniform_fast2 that I made, streams in data from arc4random() and
> uses the datastream directly and uses it as a bit by bit right "sliding
> window" in the last loop. arc4random_uniform() uses a modulus which I is
> simple to implement, but I wonder how cryptographically sound or even how
> evenly it distributes. Adding a modulus seems sloppy without something
> better. I did make arc4random_fast_simple() which merely takes an
> upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever
> the top function was into it which binary searches to find the correct size
> bitfield (return value) needed to barely fit the upperbound while also
> being able to discover every possible value below the upperbound. It isn't
> as fast as arc4random_uniform_fast2 if it were used repeatedly after a
> single use of arc4random_uniform_fast_bitsearch() , but it does exactly the
> same thing and appears faster than repeatedly using arc4random_uniform()
> and it's wasteful use of arc4random() and calling the expensive rekeying
> function more often.
> 
> It may be interesting to determine even without looking at performance,
> whether arc4random_fast_simple() creates a more superior, secure use of the
> chacha20 stream than arc4random_uniform() with the modulus. what exactly
> does all that extra data from the modulus do to the random distribution?
> 
> -Luke

I don't follow you at all. Your blabbering does not even use the terms
"uniform" and "modulo bias". I wonder even if you realize what they
mean in this context.

-Otto



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Luke Small
arc4random_uniform_fast2 that I made, streams in data from arc4random() and
uses the datastream directly and uses it as a bit by bit right "sliding
window" in the last loop. arc4random_uniform() uses a modulus which I is
simple to implement, but I wonder how cryptographically sound or even how
evenly it distributes. Adding a modulus seems sloppy without something
better. I did make arc4random_fast_simple() which merely takes an
upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever
the top function was into it which binary searches to find the correct size
bitfield (return value) needed to barely fit the upperbound while also
being able to discover every possible value below the upperbound. It isn't
as fast as arc4random_uniform_fast2 if it were used repeatedly after a
single use of arc4random_uniform_fast_bitsearch() , but it does exactly the
same thing and appears faster than repeatedly using arc4random_uniform()
and it's wasteful use of arc4random() and calling the expensive rekeying
function more often.

It may be interesting to determine even without looking at performance,
whether arc4random_fast_simple() creates a more superior, secure use of the
chacha20 stream than arc4random_uniform() with the modulus. what exactly
does all that extra data from the modulus do to the random distribution?

-Luke


Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Otto Moerbeek
In general, wasting some arc4random bits is not a problem at all.

I see a lot of code with no demonstration, explanation or proof why it
would be correct (i.e. produces uniform results). You only talk about
speed.

The current code might not be optimal, but at least it is very clear
it's correct.

-Otto


On Fri, May 13, 2022 at 09:43:26AM -0500, Luke Small wrote:

> I made a couple new versions of a new kind of arc4random_uniform-like
> function and some example functions which use them. Instead of having a
> sufficiently large random number greater than the modulus, I pick a random
> number using arc4random() from a bitfield where the length of the bitfield
> is just below or slightly beyond the value of the modulus and returns the
> bitfield it if it is less than the value of the modulus.
> 
> In both versions, I retrieve a bitfield from a static uint64_t which is
> refreshed from periodic arc4random() calls. The functions demand a bit
> length. but I provide a convenient bitsize search function:
> arc4random_uniform_fast_bitsearch()
> 
> in the first version, if the chosen return value isn't less than the
> modulus, the entire bitfield is trashed and a completely new bitfield is
> refreshed from the cache. This can be very inefficient and for certain
> upperbounds where the functions demand that all returned values less than
> the upperbound are possible. This can perform worse than
> arc4random_uniform() even with more random number generation churn.
> 
> in the second version, if the chosen return value isn't less than the
> modulus, the bitfield is shifted right, losing the least significant bit
> and a new random-value bit from the least significant bit of the cache is
> put into the most significant bit of the bitfield and it tries again. For
> significant expenditures of effort, this version is always more efficient
> than arc4random_uniform() and slightly to much more efficient than my first
> version.
> 
> 
> The performance boost comes from less pseudorandom number generation by not
> trashing perfectly good random data and preserving it so that rekeying is
> performed less.
> 
> 
> I suspect that the first version may be secure, but I'm now sure what you
> think of the second version, for being secure. Is there a way to test how
> well distributed and secure it is?!
> 
> Perhaps it's even better than the typical use case of arc4random_uniform
> since it feeds it a bitstream instead of performing a modulous operation on
> it!
> 
> I pledge("stdio") and unveil(NULL, NULL) at the beginning, so you know it
> doesn't do anything but malloc() an array, do stuff on the array and print
> stuff; if you don't profile, anyway.
> 
> What do you think of having such a thing in OpenBSD?
> 
> 
> 
> 
> newdata() generates random a-z characters performing arc4random_uniform(26);
> 
> newdataTypableFilename() generates random filename typable characters
> performing arc4random_uniform(92);
> 
> I perform other tests including presumed worst-cases on large numbers and
> numbers which will have a lot of misses.
> 
> 
> 
> 
> -Luke




Picky, but much more efficient arc4random_uniform!

2022-05-13 Thread Luke Small
I made a couple new versions of a new kind of arc4random_uniform-like
function and some example functions which use them. Instead of having a
sufficiently large random number greater than the modulus, I pick a random
number using arc4random() from a bitfield where the length of the bitfield
is just below or slightly beyond the value of the modulus and returns the
bitfield it if it is less than the value of the modulus.

In both versions, I retrieve a bitfield from a static uint64_t which is
refreshed from periodic arc4random() calls. The functions demand a bit
length. but I provide a convenient bitsize search function:
arc4random_uniform_fast_bitsearch()

in the first version, if the chosen return value isn't less than the
modulus, the entire bitfield is trashed and a completely new bitfield is
refreshed from the cache. This can be very inefficient and for certain
upperbounds where the functions demand that all returned values less than
the upperbound are possible. This can perform worse than
arc4random_uniform() even with more random number generation churn.

in the second version, if the chosen return value isn't less than the
modulus, the bitfield is shifted right, losing the least significant bit
and a new random-value bit from the least significant bit of the cache is
put into the most significant bit of the bitfield and it tries again. For
significant expenditures of effort, this version is always more efficient
than arc4random_uniform() and slightly to much more efficient than my first
version.


The performance boost comes from less pseudorandom number generation by not
trashing perfectly good random data and preserving it so that rekeying is
performed less.


I suspect that the first version may be secure, but I'm now sure what you
think of the second version, for being secure. Is there a way to test how
well distributed and secure it is?!

Perhaps it's even better than the typical use case of arc4random_uniform
since it feeds it a bitstream instead of performing a modulous operation on
it!

I pledge("stdio") and unveil(NULL, NULL) at the beginning, so you know it
doesn't do anything but malloc() an array, do stuff on the array and print
stuff; if you don't profile, anyway.

What do you think of having such a thing in OpenBSD?




newdata() generates random a-z characters performing arc4random_uniform(26);

newdataTypableFilename() generates random filename typable characters
performing arc4random_uniform(92);

I perform other tests including presumed worst-cases on large numbers and
numbers which will have a lot of misses.




-Luke


arc4random_uniform_fast.c
Description: Binary data