[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good

Daniel Cranford (JIRA) Thu, 05 Oct 2017 05:50:50 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192820#comment-16192820
 ]


Daniel Cranford commented on CASSANDRA-12744:
---------------------------------------------

Some more thoughts:

The generation of partition keys has been "broken" since CASSANDRA-7519

The Linear Congruential Generators (LCGs) used in java.util.Random and by 
extension JDKRandomGenerator generate good random number sequences, but similar 
seeds result in similar sequences. Using the lcg update function {{lcg\(x) = 
a*x + c}} like

random ~1~ = lcg(1)
random ~2~ = lcg(2)
random ~3~ = lcg(3)
...
random ~n~ = lcg\(n)

does not generate a good random sequence, this is a misuse of the LCG. LCGs are 
supposed to be used like

random ~1~ = lcg(1)
random ~2~ = lcg(lcg(1))
random ~3~ = lcg(lcg(lcg(1)))
...
random ~n~ = lcg ^n^ (1)

I say "broken" in quotes because the misuse of LCGs ends up not mattering. 
{{new java.util.Random(seed).nextDouble()}} will always differ from {{new 
java.util.Random(seed + 1).nextDouble()}} by more than 1/100,000,000,000 Thus 
with the default partition key population (=UNIFORM(1..100B)), seeds that 
differ by 1 will generate distinct partition keys.

The only thing that matters about partition keys is how many distinct values 
there are (and how large their lexical value is). The number of partition key 
components doesn't matter. The cardinality of each partition key component 
doesn't matter. The distribution of values in the lexical partition key space 
doesn't matter.

At the end of the day, all the partition key components get concatenated and 
the resulting bit vector is hashed resulting in a uniformly distributed 64 bit 
token that determines where the data will be stored.

The easiest "fix" is to not use the partition key population to define the 
number of partition keys. Take advantage of the fact that the only thing that 
matters from a performance standpoint is the number of distinct partitions. 
Leave the partition key distribution at uniform(1..100B), and use the n= 
parameter to define the number of partitions.

An ideal fix would update the way partition keys are generated to use the LCG 
generator properly. However, this seems difficult since LCGs don't support 
random access (i.e., the only way to calculate the nth item in an LCG sequence 
is to first calculate the n-1 preceding items), and all three seed generation 
modes rely on the ability to randomly jump around in the seed sequence. This 
could be worked around by using a PRNG that supports random access to the n'th 
item in the sequence (e.g. something like JDK 1.8's SpittableRandom could be 
easily extended to support this).

A more workable fix is to spread the generated seeds (typically drawn from a 
smallish range of integers) around in the 2 ^64^ values a long can take before 
seeding the LCG. An additional caveat to whatever function is used for 
spreading the seeds needs to be invertable since LookbackableWriteGenerator's 
implementation relies on the properties of the sequence it generates to perform 
internal bookeeping.

Multiplication by an odd integer happens to be an invertable function (although 
integer division is NOT the inverse operation, multiplication by the modular 
inverse is). So the initial implementation (although broken) is not actually 
that bad an idea. I propose fixing things by picking a static integer as the 
multiplier and using multiplication by it's modular inverse to invert it for 
LookbackableWriteGenerator


> Randomness of stress distributions is not good
> ----------------------------------------------
>
>                 Key: CASSANDRA-12744
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: T Jake Luciani
>            Assignee: Ben Slater
>            Priority: Minor
>              Labels: stress
>             Fix For: 4.0
>
>         Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch
>
>
> The randomness of our distributions is pretty bad.  We are using the 
> JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 
> iterations it's only outputting 3.  If you bump it to 10k it hits all 3 
> values. 
> I made a change to just use the default commons math random generator and now 
> see all 3 values for n=10



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-12744) Randomness of stress distributions is not good

Reply via email to