[
https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192820#comment-16192820
]
Daniel Cranford commented on CASSANDRA-12744:
---------------------------------------------
Some more thoughts:
The generation of partition keys has been "broken" since CASSANDRA-7519
The Linear Congruential Generators (LCGs) used in java.util.Random and by
extension JDKRandomGenerator generate good random number sequences, but similar
seeds result in similar sequences. Using the lcg update function {{lcg\(x) =
a*x + c}} like
random ~1~ = lcg(1)
random ~2~ = lcg(2)
random ~3~ = lcg(3)
...
random ~n~ = lcg\(n)
does not generate a good random sequence, this is a misuse of the LCG. LCGs are
supposed to be used like
random ~1~ = lcg(1)
random ~2~ = lcg(lcg(1))
random ~3~ = lcg(lcg(lcg(1)))
...
random ~n~ = lcg ^n^ (1)
I say "broken" in quotes because the misuse of LCGs ends up not mattering.
{{new java.util.Random(seed).nextDouble()}} will always differ from {{new
java.util.Random(seed + 1).nextDouble()}} by more than 1/100,000,000,000 Thus
with the default partition key population (=UNIFORM(1..100B)), seeds that
differ by 1 will generate distinct partition keys.
The only thing that matters about partition keys is how many distinct values
there are (and how large their lexical value is). The number of partition key
components doesn't matter. The cardinality of each partition key component
doesn't matter. The distribution of values in the lexical partition key space
doesn't matter.
At the end of the day, all the partition key components get concatenated and
the resulting bit vector is hashed resulting in a uniformly distributed 64 bit
token that determines where the data will be stored.
The easiest "fix" is to not use the partition key population to define the
number of partition keys. Take advantage of the fact that the only thing that
matters from a performance standpoint is the number of distinct partitions.
Leave the partition key distribution at uniform(1..100B), and use the n=
parameter to define the number of partitions.
An ideal fix would update the way partition keys are generated to use the LCG
generator properly. However, this seems difficult since LCGs don't support
random access (i.e., the only way to calculate the nth item in an LCG sequence
is to first calculate the n-1 preceding items), and all three seed generation
modes rely on the ability to randomly jump around in the seed sequence. This
could be worked around by using a PRNG that supports random access to the n'th
item in the sequence (e.g. something like JDK 1.8's SpittableRandom could be
easily extended to support this).
A more workable fix is to spread the generated seeds (typically drawn from a
smallish range of integers) around in the 2 ^64^ values a long can take before
seeding the LCG. An additional caveat to whatever function is used for
spreading the seeds needs to be invertable since LookbackableWriteGenerator's
implementation relies on the properties of the sequence it generates to perform
internal bookeeping.
Multiplication by an odd integer happens to be an invertable function (although
integer division is NOT the inverse operation, multiplication by the modular
inverse is). So the initial implementation (although broken) is not actually
that bad an idea. I propose fixing things by picking a static integer as the
multiplier and using multiplication by it's modular inverse to invert it for
LookbackableWriteGenerator
> Randomness of stress distributions is not good
> ----------------------------------------------
>
> Key: CASSANDRA-12744
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
> Project: Cassandra
> Issue Type: Bug
> Components: Tools
> Reporter: T Jake Luciani
> Assignee: Ben Slater
> Priority: Minor
> Labels: stress
> Fix For: 4.0
>
> Attachments: CASSANDRA_12744_SeedManager_changes-trunk.patch
>
>
> The randomness of our distributions is pretty bad. We are using the
> JDKRandomGenerator() but in testing of uniform(1..3) we see for 100
> iterations it's only outputting 3. If you bump it to 10k it hits all 3
> values.
> I made a change to just use the default commons math random generator and now
> see all 3 values for n=10
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]