Doug Cutting expressed a concern to me about using util.Random to generate
random 64 bit block numbers for NDFS.  The following is my analysis.

Random number generators for NDFS block numbers
  Requirements
    capable of billions of block numbers
    64 bit block numbers
    deterministic for reproducible testing
    ideally, want a generator that makes no repeats and is serializable.
  Concern
    Currently using built-in util.Random, is it good enough?
  Background
    Theory of hashing
      In a hashing system, the collision rate is calculated just like the 
birthday problem/paradox.
        rule of thumb is to have twice as many bits as number of bits needed 
for the total number of items.
          example: 4 bits needed to count 16 items, needs 8 bit hashes.
      In a non-hashing system, block numbers are not related to block contents, 
but the birthday problem still applies.
    Theory of  random numbers
      The amount of entropy that a PRNG instance has cannot be increased by 
concatenation.
        A full 64 bits are needed to to store 4 billion items.
      Diehard test suite is a good randomness benchmark.
  Analysis
    util.Random is a linear congruential generator (LCG) identical to drand48.
      util.Random keeps 48 bits of state and gangs together 2 consecutive 
values to return 64 bit values.
      LCGs suffer from periodicity in the low order bits which would make 
modular binning less than random.
        "low order bits" could mean least significant byte.
      LCGs have periods in the range 10^6 to 10^9 when using 32 bit words, a 
range of poor to fair.
      seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
        the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of 
time.
      Results of the Birthday Spacings Test look good.
     References
      http://www.math.utah.edu/~beebe/java/random/README
      http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
  Alternative
    MersenneTwister replacement for Random is best choice, has a 
mind-bogglingly long period.
      Matsumoto and Nishimura (1998)
      Longest period of any known generator 2^19937 or about 10^6001
      A period that exceeds the number of unique values seems ideal
        (intuitively ideal; obviously a shorter period than the number of 
unique values is a problem).
      Faster than standard Random.
      Excellent result for Diehard Birthday Spacings Test.
      Java Implementations.
        All the implementation are derived from the free to use C source.
        BSD type license
          Sean Luke implementation
        GNU Library General Public License
          Brian O. Bush implementation
        public domain
          Michael L. Brundage implementation

Reply via email to