[jira] [Comment Edited] (CASSANDRA-12744) Randomness of stress distributions is not good

Ben Slater (JIRA) Sat, 27 May 2017 23:50:25 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027722#comment-16027722
 ]


Ben Slater edited comment on CASSANDRA-12744 at 5/28/17 6:49 AM:
-----------------------------------------------------------------

So I took a look into this with the following findings:
1) The dtest is broken because it assumes that when you when c*-stress with 
n=10000 you will end up with 10,000 rows inserted when I think the actual 
functional guarantee is that it will run 10,000 insert operations.
2) However, with the JDKRandomGenerator is assumption hold up to a few hundred 
thousand records. Even with n=1M you end up with 999,999 records in the table. 
For some reason, change to the library default Well19937c generator means no 
only is the assumption broken at n=10k but seem to get proportional worse as n 
increases.

So, on those findings, I don't think changing the generator is a good idea.

So, I tried to dig a bit deeper about what was causing the issue. As part of 
this, I wrote some code to generate values directly from the distributions in 
various ways and the results all seemed as expected (ie reasonably aligned with 
the distribution type). 

After a bit more digging, and to cut a long story short, I found that the 
actual is related to the -pop setting. I'm still a bit hazy on this but it 
seems -pop is the distribution of all possible keys. So, if I have a -pop of 
dist(1..10) I can only have 10 possible key values (ie combinations across all 
columns) no matter what the ranges specified for the key column in the YAML 
file are. The default for -pop is UNIFORM(1..n) where n is specified or 
1..1,000,000 where no n is specified. I think this all results in somewhat 
counter-intuitive results, particular with multi-part keys.

So, I think the actual answer here is to change the rules for the default -pop  
for yaml runs to have a population size equal to the product of the population 
size of each key as specified in the YAML.  For example, if I have two columns: 
partition_key UNIFORM(1..1M)
cluster_key UNIFORM(1..100)

then the default population should be 1..100M. I think this is already implied 
by the YAML and what people would expect (certainly what I expected).

I don't think this change will be too hard to make but interested to hear if 
anyone has an opinions before I jump into it.


was (Author: slater_ben):
So I took a look into this with the following findings:
1) The dtest is broken because it assumes that when you when c*-stress with 
n=10000 you will end up with 10,000 rows inserted when I think the actual 
functional guarantee is that it will run 10,000 insert operations.
2) However, with the JDKRandomGenerator is assumption hold up to a few hundred 
thousand records. Even with n=1M you end up with 999,999 records in the table. 
For some reason, change to the library default Well19937c generator means no 
only is the assumption broken at n=10k but seem to get proportional worse as n 
increases.

So, on those findings, I don't think changing the generator is a good idea.

So, I tried to dig a bit deeper about what was causing the issue. As part of 
this, I wrote some code to generate values directly from the distributions in 
various ways and the results all seemed as expected (ie reasonably aligned with 
the distribution type). 

After a bit more digging, and to cut a long story short, I found that the 
actual is related to the -pop setting. I'm still a bit hazy on this but it 
seems -pop is the distribution of all possible keys. So, if I have a -pop of 
dist(1..10) I can only have 10 possible key values (ie combinations across all 
columns) no matter what the ranges specified for the key column in the YAML 
file are. The default for -pop is UNIFORM(1..n) where n is specified or 
1..1,000,000 where no n is specified. I think this all results in somewhat 
counter-intuitive results, particular with multi-part keys.

So, I think the actual answer here is to change the rules for the default -pop  
for yaml runs to have a population size equal to the product of the population 
size of each key as specified in the YAML.  For example, if I have two columns: 
partition_key UNIFORM(1..1M)
cluster_key UNIFORM(1..100)

The the default population should be 1..100M. I think this is already implied 
by the YAML and what people would expect (certainly what I expected).

I don't think this change will be two hard to make but interested to hear if 
anyone has an opinions before I jump into it.

> Randomness of stress distributions is not good
> ----------------------------------------------
>
>                 Key: CASSANDRA-12744
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12744
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: T Jake Luciani
>            Assignee: Ben Slater
>            Priority: Minor
>              Labels: stress
>             Fix For: 4.0
>
>
> The randomness of our distributions is pretty bad.  We are using the 
> JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 
> iterations it's only outputting 3.  If you bump it to 10k it hits all 3 
> values. 
> I made a change to just use the default commons math random generator and now 
> see all 3 values for n=10



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-12744) Randomness of stress distributions is not good

Reply via email to