[ 
https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571589#comment-15571589
 ] 

Ben Slater edited comment on CASSANDRA-12490 at 10/13/16 10:57 AM:
-------------------------------------------------------------------

OK, so I did some more investigation this evening to try to better understand 
this and found a few interesting things. I suspect there is at least on bug 
here but I'll be interested to see what you think.

I set up a simple spec to test what was going on:
{code}
table: test4
table_definition: |
  CREATE TABLE test4 (
        pk text,
        val text,
        PRIMARY KEY (pk)
  ) 
columnspec:
  - name: pk
    size: fixed(32) 
    population: uniform(1..50)
{code}

When I run this with `ops(insert=1) n=50` the end result is 1 row added to the 
table. When I run it with n=500 I get 3 rows. Some other observations:
a) tracing this through it seems that's because the small number of seed values 
from the population (due to the small n=) results in a very low variation in 
values being returned from `delegate.sample()` in 
`DistributionBoundApache.next()`. They were (all 37<x<38 so get round to 37 by 
`bound()`)
b) increasing n to 500 increases the number of rows to 3
c) session.execute() gets called n times despite the overlap (so it looks to me 
like it is overwritting)
d) uses exp() instead of uniform also produces the same number of rows (but 
different values) 
e) using seq() (new implementation) produces 50 rows with n=50
f) if I change the implementation of setSeed() in DistributionBoundApache to a 
null operation (as a quick test, not the right fix) I get 31 rows with n=50 and 
50 rows with n=500 which is the behaviour I would have expected

I know that the small numbers aren't necessarily representative when we're 
talking about statistical distributions but it seems the behaviour is far 
enough from what is expected to be indicative of any issue (and I suspect this 
is actually the root of what cause me to create seq() in the first place).

Feels like this is morphing into a different jira but I guess it makes sense to 
work out what that is here before opening something new.

Be very interested to hear what you think.


was (Author: slater_ben):
OK, so I did some more investigation this evening to try to better understand 
this and found a few interesting things. I suspect there is at least on bug 
here but I'll be interested to see what you think.

I set up a simple spec to test what was going on:
```
table: test4
table_definition: |
  CREATE TABLE test4 (
        pk text,
        val text,
        PRIMARY KEY (pk)
  ) 
columnspec:
  - name: pk
    size: fixed(32) 
    population: uniform(1..50)```

When I run this with `ops(insert=1) n=50` the end result is 1 row added to the 
table. When I run it with n=500 I get 3 rows. Some other observations:
a) tracing this through it seems that's because the small number of seed values 
from the population (due to the small n=) results in a very low variation in 
values being returned from `delegate.sample()` in 
`DistributionBoundApache.next()`. They were (all 37<x<38 so get round to 37 by 
`bound()`)
b) increasing n to 500 increases the number of rows to 3
c) session.execute() gets called n times despite the overlap (so it looks to me 
like it is overwritting)
d) uses exp() instead of uniform also produces the same number of rows (but 
different values) 
e) using seq() (new implementation) produces 50 rows with n=50
f) if I change the implementation of setSeed() in DistributionBoundApache to a 
null operation (as a quick test, not the right fix) I get 31 rows with n=50 and 
50 rows with n=500 which is the behaviour I would have expected

I know that the small numbers aren't necessarily representative when we're 
talking about statistical distributions but it seems the behaviour is far 
enough from what is expected to be indicative of any issue (and I suspect this 
is actually the root of what cause me to create seq() in the first place).

Feels like this is morphing into a different jira but I guess it makes sense to 
work out what that is here before opening something new.

Be very interested to hear what you think.

> Add sequence distribution type to cassandra stress
> --------------------------------------------------
>
>                 Key: CASSANDRA-12490
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Ben Slater
>            Assignee: Ben Slater
>            Priority: Minor
>             Fix For: 3.10
>
>         Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds. 
> This ensures generated values don't overlap (unless the sequence wraps) 
> providing more predictable number of inserted records (and generating a base 
> set of data without wasted writes).
> When using a yaml stress spec there is no sequenced distribution available. 
> It think it would be useful to have this for doing initial load of data for 
> testing 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to