Doug Cutting expressed a concern to me about using util.Random to generate
random 64 bit block numbers for NDFS. The following is my analysis.
Random number generators for NDFS block numbers
Requirements
capable of billions of block numbers
64 bit block numbers
deterministic for reproducible testing
ideally, want a generator that makes no repeats and is serializable.
Concern
Currently using built-in util.Random, is it good enough?
Background
Theory of hashing
In a hashing system, the collision rate is calculated just like the
birthday problem/paradox.
rule of thumb is to have twice as many bits as number of bits needed
for the total number of items.
example: 4 bits needed to count 16 items, needs 8 bit hashes.
In a non-hashing system, block numbers are not related to block contents,
but the birthday problem still applies.
Theory of random numbers
The amount of entropy that a PRNG instance has cannot be increased by
concatenation.
A full 64 bits are needed to to store 4 billion items.
Diehard test suite is a good randomness benchmark.
Analysis
util.Random is a linear congruential generator (LCG) identical to drand48.
util.Random keeps 48 bits of state and gangs together 2 consecutive
values to return 64 bit values.
LCGs suffer from periodicity in the low order bits which would make
modular binning less than random.
"low order bits" could mean least significant byte.
LCGs have periods in the range 10^6 to 10^9 when using 32 bit words, a
range of poor to fair.
seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of
time.
Results of the Birthday Spacings Test look good.
References
http://www.math.utah.edu/~beebe/java/random/README
http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
Alternative
MersenneTwister replacement for Random is best choice, has a
mind-bogglingly long period.
Matsumoto and Nishimura (1998)
Longest period of any known generator 2^19937 or about 10^6001
A period that exceeds the number of unique values seems ideal
(intuitively ideal; obviously a shorter period than the number of
unique values is a problem).
Faster than standard Random.
Excellent result for Diehard Birthday Spacings Test.
Java Implementations.
All the implementation are derived from the free to use C source.
BSD type license
Sean Luke implementation
GNU Library General Public License
Brian O. Bush implementation
public domain
Michael L. Brundage implementation