Ben Slater commented on CASSANDRA-12490:
OK, so I did some more investigation this evening to try to better understand
this and found a few interesting things. I suspect there is at least on bug
here but I'll be interested to see what you think.
I set up a simple spec to test what was going on:
CREATE TABLE test4 (
PRIMARY KEY (pk)
- name: pk
When I run this with `ops(insert=1) n=50` the end result is 1 row added to the
table. When I run it with n=500 I get 3 rows. Some other observations:
a) tracing this through it seems that's because the small number of seed values
from the population (due to the small n=) results in a very low variation in
values being returned from `delegate.sample()` in
`DistributionBoundApache.next()`. They were (all 37<x<38 so get round to 37 by
b) increasing n to 500 increases the number of rows to 3
c) session.execute() gets called n times despite the overlap (so it looks to me
like it is overwritting)
d) uses exp() instead of uniform also produces the same number of rows (but
e) using seq() (new implementation) produces 50 rows with n=50
f) if I change the implementation of setSeed() in DistributionBoundApache to a
null operation (as a quick test, not the right fix) I get 31 rows with n=50 and
50 rows with n=500 which is the behaviour I would have expected
I know that the small numbers aren't necessarily representative when we're
talking about statistical distributions but it seems the behaviour is far
enough from what is expected to be indicative of any issue (and I suspect this
is actually the root of what cause me to create seq() in the first place).
Feels like this is morphing into a different jira but I guess it makes sense to
work out what that is here before opening something new.
Be very interested to hear what you think.
> Add sequence distribution type to cassandra stress
> Key: CASSANDRA-12490
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
> Reporter: Ben Slater
> Assignee: Ben Slater
> Priority: Minor
> Fix For: 3.10
> Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
> When using the write command, cassandra stress sequentially generates seeds.
> This ensures generated values don't overlap (unless the sequence wraps)
> providing more predictable number of inserted records (and generating a base
> set of data without wasted writes).
> When using a yaml stress spec there is no sequenced distribution available.
> It think it would be useful to have this for doing initial load of data for
This message was sent by Atlassian JIRA