[
https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027722#comment-16027722
]
Ben Slater edited comment on CASSANDRA-12744 at 5/28/17 6:49 AM:
-----------------------------------------------------------------
So I took a look into this with the following findings:
1) The dtest is broken because it assumes that when you when c*-stress with
n=10000 you will end up with 10,000 rows inserted when I think the actual
functional guarantee is that it will run 10,000 insert operations.
2) However, with the JDKRandomGenerator is assumption hold up to a few hundred
thousand records. Even with n=1M you end up with 999,999 records in the table.
For some reason, change to the library default Well19937c generator means no
only is the assumption broken at n=10k but seem to get proportional worse as n
increases.
So, on those findings, I don't think changing the generator is a good idea.
So, I tried to dig a bit deeper about what was causing the issue. As part of
this, I wrote some code to generate values directly from the distributions in
various ways and the results all seemed as expected (ie reasonably aligned with
the distribution type).
After a bit more digging, and to cut a long story short, I found that the
actual is related to the -pop setting. I'm still a bit hazy on this but it
seems -pop is the distribution of all possible keys. So, if I have a -pop of
dist(1..10) I can only have 10 possible key values (ie combinations across all
columns) no matter what the ranges specified for the key column in the YAML
file are. The default for -pop is UNIFORM(1..n) where n is specified or
1..1,000,000 where no n is specified. I think this all results in somewhat
counter-intuitive results, particular with multi-part keys.
So, I think the actual answer here is to change the rules for the default -pop
for yaml runs to have a population size equal to the product of the population
size of each key as specified in the YAML. For example, if I have two columns:
partition_key UNIFORM(1..1M)
cluster_key UNIFORM(1..100)
then the default population should be 1..100M. I think this is already implied
by the YAML and what people would expect (certainly what I expected).
I don't think this change will be too hard to make but interested to hear if
anyone has an opinions before I jump into it.
was (Author: slater_ben):
So I took a look into this with the following findings:
1) The dtest is broken because it assumes that when you when c*-stress with
n=10000 you will end up with 10,000 rows inserted when I think the actual
functional guarantee is that it will run 10,000 insert operations.
2) However, with the JDKRandomGenerator is assumption hold up to a few hundred
thousand records. Even with n=1M you end up with 999,999 records in the table.
For some reason, change to the library default Well19937c generator means no
only is the assumption broken at n=10k but seem to get proportional worse as n
increases.
So, on those findings, I don't think changing the generator is a good idea.
So, I tried to dig a bit deeper about what was causing the issue. As part of
this, I wrote some code to generate values directly from the distributions in
various ways and the results all seemed as expected (ie reasonably aligned with
the distribution type).
After a bit more digging, and to cut a long story short, I found that the
actual is related to the -pop setting. I'm still a bit hazy on this but it
seems -pop is the distribution of all possible keys. So, if I have a -pop of
dist(1..10) I can only have 10 possible key values (ie combinations across all
columns) no matter what the ranges specified for the key column in the YAML
file are. The default for -pop is UNIFORM(1..n) where n is specified or
1..1,000,000 where no n is specified. I think this all results in somewhat
counter-intuitive results, particular with multi-part keys.
So, I think the actual answer here is to change the rules for the default -pop
for yaml runs to have a population size equal to the product of the population
size of each key as specified in the YAML. For example, if I have two columns:
partition_key UNIFORM(1..1M)
cluster_key UNIFORM(1..100)
The the default population should be 1..100M. I think this is already implied
by the YAML and what people would expect (certainly what I expected).
I don't think this change will be two hard to make but interested to hear if
anyone has an opinions before I jump into it.
> Randomness of stress distributions is not good
> ----------------------------------------------
>
> Key: CASSANDRA-12744
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
> Project: Cassandra
> Issue Type: Bug
> Components: Tools
> Reporter: T Jake Luciani
> Assignee: Ben Slater
> Priority: Minor
> Labels: stress
> Fix For: 4.0
>
>
> The randomness of our distributions is pretty bad. We are using the
> JDKRandomGenerator() but in testing of uniform(1..3) we see for 100
> iterations it's only outputting 3. If you bump it to 10k it hits all 3
> values.
> I made a change to just use the default commons math random generator and now
> see all 3 values for n=10
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]