block numbers need a better random number generator
---------------------------------------------------
Key: NUTCH-122
URL: http://issues.apache.org/jira/browse/NUTCH-122
Project: Nutch
Type: Bug
Components: fetcher, indexer, searcher
Versions: 0.8-dev
Reporter: Paul Baclace
In order to support billions of block numbers, a better PRNG than
java.util.Random is needed. To reach billions with low probability of
collision, 64 bit random numbers are needed (the Birthday Problem is the model
for the number of bits needed; the result is that twice as many bits are needed
as the number of bits to count the expected number of items.) The built-in
java.util.Random keeps only 48 bits of state which is only sufficient for 2^24
items. Using repeated calls to or more than one instance of Random does not
increase its total entropy.
Analysis
util.Random is a linear congruential generator (LCG) identical to drand48.
util.Random keeps 48 bits of state and gangs together 2 consecutive
values to return 64 bit values.
LCGs suffer from periodicity in the low order bits which would make
modular binning less than random.
"low order bits" could mean least significant byte.
LCGs have periods in the range 106 to 109 when using 32 bit words, a
range of poor to fair.
seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of
time.
Results of the Birthday Spacings Test look good.
References
http://www.math.utah.edu/~beebe/java/random/README
http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
Recommended alternative: MersenneTwister
Matsumoto and Nishimura (1998).
Longest period of any known generator 2^19937 or about 10^6001.
A period that exceeds the number of unique values seems ideal; obviously
a shorter period than the number of unique values (like util.Random) is a
problem).
Faster than java.util.Random (Random was recent tweaked, however).
Excellent result for Diehard Birthday Spacings Test.
Can be seeded with up to 624 32 bit integers.
Doug Cutting wrote on nutch-dev:
> It just occurred to me that perhaps we could simply use sequential block
> numbering.
> All block ids are generated centrally on the namenode.
Response from Paul Baclace:
I'm not sure what the advantage of sequential block numbers would be
since long period PRNG block numbering does not even need to store
it's state, just pick a new starting place.
Sequential block numbering does have the downside that picking a datanode based
on (BlockNum % DataNodeCount) would devolve into round robin. Any attempt to
pass the sequence through a hash ends up becoming a random number generator.
Sequential numbering provides contiguous numbers, but after G.C. that would be
lost, so no advantage there.
When human beings eyeball block numbers, many with small differences are more
likely to be misread than many that are totally different.
If block numbering is sequential, then there is a temptation to use 32 bits
instead of 64, but 32 bits leads to wrap-around and uh oh.
FSNamesystem uses Random to help pick a target datanode, but it could just use
the randomness of block numbers.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira