On Mon, Jun 26, 2017 at 8:57 PM, Steve Tolkin <[email protected]> wrote:
> After a hiatus of 10+ years from writing any programs I am now writing some
> programs to generate synthetic test data for a performance bakeoff between
> two (or more) different systems.  So we want exactly the same data on each
> system.

Seems reasonable

> One principle I am following is never to call a random number generator.

A good plan.
#include <von-neuman/state_of_sin>

OTOH seeding a known standard cryptographic hashing algorithm (or
other quality standardized but not crypto pRNG) with a specified seed
value and iterating will generate a deterministic (reproducible)
sequence of random words of bits, which is not platform dependent.
Crypto libraries provide what you need, no development necessary,
already interoperability tested.

(But you can't use the cRNG of the friendliest cypto libraries because
to avoid stupid uses, they prevent setting the seed of their random
oracle and seed and reseed with real collected entropy. They're not
quite true RNG's because they stretch the received entropy with
cryptographically strong mixing, but they're no longer deterministic.)

> The following subroutine deterministic_shuffle uses some of the digits in
> fractional part of the log of a number in base e as a pseudo-random number.

> I have forgotten most of the Perl I once knew -- even some elementary stuff
> like array references.

Lack of practice will do that ... you should come to tech meetings !

> I am looking for two different kinds of feedback:
> 1. How to improve the code.  It is not to be more "perlish" or "perly", but

ok

> better from a software engineering perspective.  In fact the less perlish
> the better, because it is possible this algorithm may need to be implemented
> in other languages.

see above comment about portable QA'd deterministic RNGs off the shelf ?

> 2. How to improve the mathematics used. I think these digits will be random
> enough for my purposes.

Rule 1. Never write your own RNG or Cypto primitives.
Rule 2. See rule 1.
Rule 3. Seek qualified help if selecting and using quality crypto
primitives not obvious.
Rule 4. Don't use floating point if you care about correctness /
reproducibility.

If you try to get entropy out of 'e' or 'pi', you really should be
using Bignums and a spigot algorithm to emit digits (or bytes),

>  See especially the comment starting:    # Maybe
> later have this multiplier be something like 10 * $array_size

If your shuffle gets the same answer on platforms using something
other than Clib's log() with this multiplier, i'll be pleasantly
surprised. It's discarding the significant digits of the
approximation.


Deterministic randomness is mind-bending.
   You likely care more about the reproducibility than the
high-dimensional spectral purity of of your random numbers.
   A really bad random number generator that will not overflow or
underflow in any normal language is better than something clever like
your use of log() which is far less portable in the randomish bits
than you'd hope.
   If you want mulitdimensional spectral purity and determinism, then
iterating a cryptographic hash is a good plan.

e.g.,
# bash ...

x=00000000000000000000000000000000000000000000000000000000;
y=$x; while true; do y=$(echo $y | sha256sum --tag  | cut -f 4 -d ' '
); echo $y; done | head -40

should produce the same answer on any system ... repeatable ...

[Explanation : This is re-hashing the lower-case hex expansion of the
prior 256 bit value.
   NB: For cryptographic use, we wouldn't want the 4bit to 8bit
expansion to be limited to 0x0 ..0xf => [0-9a-f] as that's a bit too
predictable; e.g, all bytes representing binary nibbles are 7-bit pure
ASCII and either LC alpha or numeric so the 2^5 bit is always on, but
the hash should bury that for your purposes.]

Or one could take feedback un-expanded in binary, or use a different
cryptographic mixing expansion, e.g. encrypt a several blocks with
state as key, to produce the next hash input.

Or, for best spectral purity of RNG output, a block-cipher such in
Counter mode (CTR or CM) will mix thoroughly.
https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29

Stream Counter mode is one mode where NACL library will allow the user
to specify both key and nonce, so that a repeatably seeded simulation
is possible.  AES-128 is one of the available.

"Networking and Cryptography library" NaCl is available either as
libsodium library bindings or native implementation for most
languages.
https://en.wikipedia.org/wiki/NaCl_(software)
( NaCl was launched by DJB, so it's obviously the good stuff. )
Perl https://metacpan.org/pod/Crypt::NaCl::Sodium
Java https://github.com/neilalexander/jnacl &
https://github.com/abstractj/kalium


-- 
Bill Ricker
[email protected]
https://www.linkedin.com/in/n1vux

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Reply via email to