[ 
https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15565128#comment-15565128
 ] 

Benedict commented on CASSANDRA-12490:
--------------------------------------

bq. which does make me wonder how stress is respecting the distribution for the 
PK value 

The random number generator is separate to the distribution.  Resetting the 
seed just provides a deterministically pseudo-random value to the distribution; 
it does not affect the distribution of the output value.

bq. For say normal distribution you'd need several * n to cover all the 
possible values and have close to a normal distribution

Well, it's not clear what the "correct" behaviour is here - which spec should 
win? The two specs (population and value count) are at odds, and users found it 
confusing (esp. for populating a dataset) to have far fewer values produced 
than they expected (for any population distribution).  So, when it is *easy* to 
do so we let value count win the competition, i.e. when the total set of values 
we should produce is known.  Though I'll admit I don't like the inconsistency. 
Perhaps we should always honour it, switching to the inverse distribution and 
deselecting values where it would be costly.

bq. (a) there was already a sequence distribution type in the "legacy" 
distribution sets (presumably for just this purpose) 

I'm really not sure what you're referring to here.  It's been a while since I 
wrote this or looked at the really old legacy stuff, but I don't recall any 
legacy sequential distribution. There is a sequential *population* mode, i.e. 
to visit partitions in order, but I don't recall a sequential distribution (and 
I just had a glance at the source code and couldn't find it).

bq. (b) to me, one way of describing this is a uniform distribution with 
minimal chance of collisions (ie it's just another way for selecting values 
from a range).

Well, here's the rub, that is not consistent semantically to other 
distributions; you even had to qualify with "to me" - unfortunately (as I have 
discovered) not everyone has your own intuitions.  Users will be surprised, and 
annoyance and JIRA traffic will ensue. stress is already too difficult to use, 
and I think we should be aiming for maximum conformity of behaviour.  We have, 
after all, just discussed three possible variants of the "same" distribution - 
and users won't know which of these they have.  That said, since your modified 
approach doesn't absolutely break behaviour I just think it's a bad idea, not a 
terrible one.

bq. Finally, it's not quite correct to say I'm trying to populate all possible 
values for a column, rather trying to generate as many unique values as 
possible (within the specified ranges) for a given sample size (to minimise 
overwriting)

There is no overwriting, just the samples are discarded if they clash (when the 
population distribution wins).  It seems like you fall into the large bucket of 
users I mentioned before, who really want the count to be a value count not a 
sample count, so why not just honour this everywhere with the approach I 
suggested above?  i.e. If it looks likely to be costly to select {{count}} 
values from the distribution (because of collisions), instead invert the 
probability function and deselect values.



> Add sequence distribution type to cassandra stress
> --------------------------------------------------
>
>                 Key: CASSANDRA-12490
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Ben Slater
>            Assignee: Ben Slater
>            Priority: Minor
>             Fix For: 3.10
>
>         Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds. 
> This ensures generated values don't overlap (unless the sequence wraps) 
> providing more predictable number of inserted records (and generating a base 
> set of data without wasted writes).
> When using a yaml stress spec there is no sequenced distribution available. 
> It think it would be useful to have this for doing initial load of data for 
> testing 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to