[ https://issues.apache.org/jira/browse/CASSANDRA-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027722#comment-16027722 ]
Ben Slater commented on CASSANDRA-12744: ---------------------------------------- So I took a look into this with the following findings: 1) The dtest is broken because it assumes that when you when c*-stress with n=10000 you will end up with 10,000 rows inserted when I think the actual functional guarantee is that it will run 10,000 insert operations. 2) However, with the JDKRandomGenerator is assumption hold up to a few hundred thousand records. Even with n=1M you end up with 999,999 records in the table. For some reason, change to the library default Well19937c generator means no only is the assumption broken at n=10k but seem to get proportional worse as n increases. So, on those findings, I don't think changing the generator is a good idea. So, I tried to dig a bit deeper about what was causing the issue. As part of this, I wrote some code to generate values directly from the distributions in various ways and the results all seemed as expected (ie reasonably aligned with the distribution type). After a bit more digging, and to cut a long story short, I found that the actual is related to the -pop setting. I'm still a bit hazy on this but it seems -pop is the distribution of all possible keys. So, if I have a -pop of dist(1..10) I can only have 10 possible key values (ie combinations across all columns) no matter what the ranges specified for the key column in the YAML file are. The default for -pop is UNIFORM(1..n) where n is specified or 1..1,000,000 where no n is specified. I think this all results in somewhat counter-intuitive results, particular with multi-part keys. So, I think the actual answer here is to change the rules for the default -pop for yaml runs to have a population size equal to the product of the population size of each key as specified in the YAML. For example, if I have two columns: partition_key UNIFORM(1..1M) cluster_key UNIFORM(1..100) The the default population should be 1..100M. I think this is already implied by the YAML and what people would expect (certainly what I expected). I don't think this change will be two hard to make but interested to hear if anyone has an opinions before I jump into it. > Randomness of stress distributions is not good > ---------------------------------------------- > > Key: CASSANDRA-12744 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12744 > Project: Cassandra > Issue Type: Bug > Components: Tools > Reporter: T Jake Luciani > Assignee: Ben Slater > Priority: Minor > Labels: stress > Fix For: 4.0 > > > The randomness of our distributions is pretty bad. We are using the > JDKRandomGenerator() but in testing of uniform(1..3) we see for 100 > iterations it's only outputting 3. If you bump it to 10k it hits all 3 > values. > I made a change to just use the default commons math random generator and now > see all 3 values for n=10 -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org