Deletion of these spam comments is addressed in
  https://issues.apache.org/jira/browse/INFRA-22787

On 1/24/22 15:23, pankaj kumar singh (Jira) wrote:
> 
>     [ 
> https://issues.apache.org/jira/browse/NUTCH-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481130#comment-17481130
>  ] 
> 
> pankaj kumar singh commented on NUTCH-122:
> ------------------------------------------
> 
> The product is berry flavor, and it tastes very good
> 
> [https://provensupplement.org/|https://provensupplement.org/]
> 
>> block numbers need a better random number generator
>> ---------------------------------------------------
>>
>>                 Key: NUTCH-122
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-122
>>             Project: Nutch
>>          Issue Type: Bug
>>          Components: fetcher, indexer
>>    Affects Versions: 0.8
>>            Reporter: Paul Baclace
>>            Priority: Major
>>         Attachments: MersenneTwister.java, MersenneTwister.java
>>
>>
>> In order to support billions of block numbers, a better PRNG than 
>> java.util.Random is needed.  To reach billions with low probability of 
>> collision, 64 bit random numbers are needed (the Birthday Problem is the 
>> model for the number of bits needed; the result is that twice as many bits 
>> are needed as the number of bits to count the expected number of items.) The 
>> built-in java.util.Random keeps only 48 bits of state which is only 
>> sufficient for 2^24 items.  Using repeated calls to or more than one 
>> instance of Random does not increase its total entropy.  
>>   Analysis
>>     util.Random is a linear congruential generator (LCG) identical to 
>> drand48.
>>       util.Random keeps 48 bits of state and gangs together 2 consecutive 
>> values to return 64 bit values.
>>       LCGs suffer from periodicity in the low order bits which would make 
>> modular binning less than random.
>>         "low order bits" could mean least significant byte.
>>       LCGs have periods in the range 106 to 109 when using 32 bit words, a 
>> range of poor to fair.
>>       seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
>>         the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of 
>> time.
>>       Results of the Birthday Spacings Test look good.
>>      References
>>       http://www.math.utah.edu/~beebe/java/random/README
>>       http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
>> Recommended alternative:    MersenneTwister
>>       Matsumoto and Nishimura (1998).
>>       Longest period of any known generator 2^19937 or about 10^6001.
>>       A period that exceeds the number of unique values seems ideal; 
>> obviously a shorter period than the number of unique values (like 
>> util.Random)  is a problem).
>>       Faster than java.util.Random (Random was recent tweaked, however).
>>       Excellent result for Diehard Birthday Spacings Test.
>>       Can be seeded with up to 624 32 bit integers.
>> Doug Cutting wrote on nutch-dev:
>>> It just occurred to me that perhaps we could simply use sequential block 
>>> numbering. 
>>>  All block ids are generated centrally on the namenode.  
>> Response from Paul Baclace:
>> I'm not sure what the advantage of sequential block numbers would be
>> since long period PRNG block numbering does not even need to store
>> it's state, just pick a new starting place.
>> Sequential block numbering does have the downside that picking a datanode 
>> based on (BlockNum % DataNodeCount) would devolve into round robin.  Any 
>> attempt to pass the sequence through a hash ends up becoming a random number 
>> generator.
>> Sequential numbering provides contiguous numbers, but after G.C. that would 
>> be lost, so no advantage there.
>> When human beings eyeball block numbers, many with small differences are 
>> more likely to be misread than many that are totally different.
>> If block numbering is sequential, then there is a temptation to use 32 bits 
>> instead of 64, but 32 bits leads to wrap-around and uh oh. 
>> FSNamesystem uses Random to help pick a target datanode, but it could just 
>> use the randomness of block numbers.
> 
> 
> 
> --
> This message was sent by Atlassian Jira
> (v8.20.1#820001)
> 

Reply via email to