[ https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565128#comment-15565128 ]
Benedict commented on CASSANDRA-12490: -------------------------------------- bq. which does make me wonder how stress is respecting the distribution for the PK value The random number generator is separate to the distribution. Resetting the seed just provides a deterministically pseudo-random value to the distribution; it does not affect the distribution of the output value. bq. For say normal distribution you'd need several * n to cover all the possible values and have close to a normal distribution Well, it's not clear what the "correct" behaviour is here - which spec should win? The two specs (population and value count) are at odds, and users found it confusing (esp. for populating a dataset) to have far fewer values produced than they expected (for any population distribution). So, when it is *easy* to do so we let value count win the competition, i.e. when the total set of values we should produce is known. Though I'll admit I don't like the inconsistency. Perhaps we should always honour it, switching to the inverse distribution and deselecting values where it would be costly. bq. (a) there was already a sequence distribution type in the "legacy" distribution sets (presumably for just this purpose) I'm really not sure what you're referring to here. It's been a while since I wrote this or looked at the really old legacy stuff, but I don't recall any legacy sequential distribution. There is a sequential *population* mode, i.e. to visit partitions in order, but I don't recall a sequential distribution (and I just had a glance at the source code and couldn't find it). bq. (b) to me, one way of describing this is a uniform distribution with minimal chance of collisions (ie it's just another way for selecting values from a range). Well, here's the rub, that is not consistent semantically to other distributions; you even had to qualify with "to me" - unfortunately (as I have discovered) not everyone has your own intuitions. Users will be surprised, and annoyance and JIRA traffic will ensue. stress is already too difficult to use, and I think we should be aiming for maximum conformity of behaviour. We have, after all, just discussed three possible variants of the "same" distribution - and users won't know which of these they have. That said, since your modified approach doesn't absolutely break behaviour I just think it's a bad idea, not a terrible one. bq. Finally, it's not quite correct to say I'm trying to populate all possible values for a column, rather trying to generate as many unique values as possible (within the specified ranges) for a given sample size (to minimise overwriting) There is no overwriting, just the samples are discarded if they clash (when the population distribution wins). It seems like you fall into the large bucket of users I mentioned before, who really want the count to be a value count not a sample count, so why not just honour this everywhere with the approach I suggested above? i.e. If it looks likely to be costly to select {{count}} values from the distribution (because of collisions), instead invert the probability function and deselect values. > Add sequence distribution type to cassandra stress > -------------------------------------------------- > > Key: CASSANDRA-12490 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12490 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Ben Slater > Assignee: Ben Slater > Priority: Minor > Fix For: 3.10 > > Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml > > > When using the write command, cassandra stress sequentially generates seeds. > This ensures generated values don't overlap (unless the sequence wraps) > providing more predictable number of inserted records (and generating a base > set of data without wasted writes). > When using a yaml stress spec there is no sequenced distribution available. > It think it would be useful to have this for doing initial load of data for > testing -- This message was sent by Atlassian JIRA (v6.3.4#6332)