Doug Cutting expressed a concern to me about using util.Random to generate random 64 bit block numbers for NDFS. The following is my analysis.
Random number generators for NDFS block numbers Requirements capable of billions of block numbers 64 bit block numbers deterministic for reproducible testing ideally, want a generator that makes no repeats and is serializable. Concern Currently using built-in util.Random, is it good enough? Background Theory of hashing In a hashing system, the collision rate is calculated just like the birthday problem/paradox. rule of thumb is to have twice as many bits as number of bits needed for the total number of items. example: 4 bits needed to count 16 items, needs 8 bit hashes. In a non-hashing system, block numbers are not related to block contents, but the birthday problem still applies. Theory of random numbers The amount of entropy that a PRNG instance has cannot be increased by concatenation. A full 64 bits are needed to to store 4 billion items. Diehard test suite is a good randomness benchmark. Analysis util.Random is a linear congruential generator (LCG) identical to drand48. util.Random keeps 48 bits of state and gangs together 2 consecutive values to return 64 bit values. LCGs suffer from periodicity in the low order bits which would make modular binning less than random. "low order bits" could mean least significant byte. LCGs have periods in the range 10^6 to 10^9 when using 32 bit words, a range of poor to fair. seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1); the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of time. Results of the Birthday Spacings Test look good. References http://www.math.utah.edu/~beebe/java/random/README http://www.pierssen.com/arcview/upload/esoterica/randomizer.html Alternative MersenneTwister replacement for Random is best choice, has a mind-bogglingly long period. Matsumoto and Nishimura (1998) Longest period of any known generator 2^19937 or about 10^6001 A period that exceeds the number of unique values seems ideal (intuitively ideal; obviously a shorter period than the number of unique values is a problem). Faster than standard Random. Excellent result for Diehard Birthday Spacings Test. Java Implementations. All the implementation are derived from the free to use C source. BSD type license Sean Luke implementation GNU Library General Public License Brian O. Bush implementation public domain Michael L. Brundage implementation