Deletion of these spam comments is addressed in https://issues.apache.org/jira/browse/INFRA-22787
On 1/24/22 15:23, pankaj kumar singh (Jira) wrote: > > [ > https://issues.apache.org/jira/browse/NUTCH-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481130#comment-17481130 > ] > > pankaj kumar singh commented on NUTCH-122: > ------------------------------------------ > > The product is berry flavor, and it tastes very good > > [https://provensupplement.org/|https://provensupplement.org/] > >> block numbers need a better random number generator >> --------------------------------------------------- >> >> Key: NUTCH-122 >> URL: https://issues.apache.org/jira/browse/NUTCH-122 >> Project: Nutch >> Issue Type: Bug >> Components: fetcher, indexer >> Affects Versions: 0.8 >> Reporter: Paul Baclace >> Priority: Major >> Attachments: MersenneTwister.java, MersenneTwister.java >> >> >> In order to support billions of block numbers, a better PRNG than >> java.util.Random is needed. To reach billions with low probability of >> collision, 64 bit random numbers are needed (the Birthday Problem is the >> model for the number of bits needed; the result is that twice as many bits >> are needed as the number of bits to count the expected number of items.) The >> built-in java.util.Random keeps only 48 bits of state which is only >> sufficient for 2^24 items. Using repeated calls to or more than one >> instance of Random does not increase its total entropy. >> Analysis >> util.Random is a linear congruential generator (LCG) identical to >> drand48. >> util.Random keeps 48 bits of state and gangs together 2 consecutive >> values to return 64 bit values. >> LCGs suffer from periodicity in the low order bits which would make >> modular binning less than random. >> "low order bits" could mean least significant byte. >> LCGs have periods in the range 106 to 109 when using 32 bit words, a >> range of poor to fair. >> seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1); >> the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of >> time. >> Results of the Birthday Spacings Test look good. >> References >> http://www.math.utah.edu/~beebe/java/random/README >> http://www.pierssen.com/arcview/upload/esoterica/randomizer.html >> Recommended alternative: MersenneTwister >> Matsumoto and Nishimura (1998). >> Longest period of any known generator 2^19937 or about 10^6001. >> A period that exceeds the number of unique values seems ideal; >> obviously a shorter period than the number of unique values (like >> util.Random) is a problem). >> Faster than java.util.Random (Random was recent tweaked, however). >> Excellent result for Diehard Birthday Spacings Test. >> Can be seeded with up to 624 32 bit integers. >> Doug Cutting wrote on nutch-dev: >>> It just occurred to me that perhaps we could simply use sequential block >>> numbering. >>> All block ids are generated centrally on the namenode. >> Response from Paul Baclace: >> I'm not sure what the advantage of sequential block numbers would be >> since long period PRNG block numbering does not even need to store >> it's state, just pick a new starting place. >> Sequential block numbering does have the downside that picking a datanode >> based on (BlockNum % DataNodeCount) would devolve into round robin. Any >> attempt to pass the sequence through a hash ends up becoming a random number >> generator. >> Sequential numbering provides contiguous numbers, but after G.C. that would >> be lost, so no advantage there. >> When human beings eyeball block numbers, many with small differences are >> more likely to be misread than many that are totally different. >> If block numbering is sequential, then there is a temptation to use 32 bits >> instead of 64, but 32 bits leads to wrap-around and uh oh. >> FSNamesystem uses Random to help pick a target datanode, but it could just >> use the randomness of block numbers. > > > > -- > This message was sent by Atlassian Jira > (v8.20.1#820001) >

