[ https://issues.apache.org/jira/browse/NUTCH-122 ]
Chris Lambertus deleted comment on NUTCH-122:
---------------------------------------
was (Author: JIRAUSER284099):
emphasizing this factor because the digestive benefits are the only claims
[https://sonuscompletestore.com/|https://sonuscompletestore.com/]
> block numbers need a better random number generator
> ---------------------------------------------------
>
> Key: NUTCH-122
> URL: https://issues.apache.org/jira/browse/NUTCH-122
> Project: Nutch
> Issue Type: Bug
> Components: fetcher, indexer
> Affects Versions: 0.8
> Reporter: Paul Baclace
> Priority: Major
> Attachments: MersenneTwister.java, MersenneTwister.java
>
>
> In order to support billions of block numbers, a better PRNG than
> java.util.Random is needed. To reach billions with low probability of
> collision, 64 bit random numbers are needed (the Birthday Problem is the
> model for the number of bits needed; the result is that twice as many bits
> are needed as the number of bits to count the expected number of items.) The
> built-in java.util.Random keeps only 48 bits of state which is only
> sufficient for 2^24 items. Using repeated calls to or more than one instance
> of Random does not increase its total entropy.
> Analysis
> util.Random is a linear congruential generator (LCG) identical to drand48.
> util.Random keeps 48 bits of state and gangs together 2 consecutive
> values to return 64 bit values.
> LCGs suffer from periodicity in the low order bits which would make
> modular binning less than random.
> "low order bits" could mean least significant byte.
> LCGs have periods in the range 106 to 109 when using 32 bit words, a
> range of poor to fair.
> seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
> the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of
> time.
> Results of the Birthday Spacings Test look good.
> References
> http://www.math.utah.edu/~beebe/java/random/README
> http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
> Recommended alternative: MersenneTwister
> Matsumoto and Nishimura (1998).
> Longest period of any known generator 2^19937 or about 10^6001.
> A period that exceeds the number of unique values seems ideal;
> obviously a shorter period than the number of unique values (like
> util.Random) is a problem).
> Faster than java.util.Random (Random was recent tweaked, however).
> Excellent result for Diehard Birthday Spacings Test.
> Can be seeded with up to 624 32 bit integers.
> Doug Cutting wrote on nutch-dev:
> > It just occurred to me that perhaps we could simply use sequential block
> > numbering.
> > All block ids are generated centrally on the namenode.
> Response from Paul Baclace:
> I'm not sure what the advantage of sequential block numbers would be
> since long period PRNG block numbering does not even need to store
> it's state, just pick a new starting place.
> Sequential block numbering does have the downside that picking a datanode
> based on (BlockNum % DataNodeCount) would devolve into round robin. Any
> attempt to pass the sequence through a hash ends up becoming a random number
> generator.
> Sequential numbering provides contiguous numbers, but after G.C. that would
> be lost, so no advantage there.
> When human beings eyeball block numbers, many with small differences are more
> likely to be misread than many that are totally different.
> If block numbering is sequential, then there is a temptation to use 32 bits
> instead of 64, but 32 bits leads to wrap-around and uh oh.
> FSNamesystem uses Random to help pick a target datanode, but it could just
> use the randomness of block numbers.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)