[jira] [Comment Edited] (CASSANDRA-11138) cassandra-stress tool - clustering key values not distributed

Alwyn Davis (JIRA) Thu, 22 Sep 2016 16:37:33 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514810#comment-15514810
 ]


Alwyn Davis edited comment on CASSANDRA-11138 at 9/22/16 11:36 PM:
-------------------------------------------------------------------

I think this is occurring because the {{lastRow}} in the {{PartitionIterator}} 
class will always exit once the clustering components are exhausted, for 3 or 
more clustering keys.  

When {{decompose}} creates lastRow, position is always a product of 
{{generator.clusteringDescendantAverages\[0\]}} and is then divided by it 
again.  So there's never a remainder and consequently any lastRow to currentRow 
comparison will indicate that there are more distinct values in the currentRow 
column then we want - lastRow will be something like:
{code}{<clusteringDescendantAverages\[0\] * firstComponentCount>, 0, 0}{code}

As a fix, could it instead set lastRow (for MultiRowIterators) to just the 
corresponding clusteringDescendantAverages values?


was (Author: alwyn):
I think this is occurring because the {{lastRow}} in the {{PartitionIterator}} 
class will always exit once the clustering components are exhausted, for 3 or 
more clustering keys.  

When {{decompose}} creates lastRow, position is always a product of 
{{generator.clusteringDescendantAverages\[0\]}} and is then divided by it 
again.  So there's never a remainder and consequently any lastRow to currentRow 
comparison will indicate that there are more distinct values in the currentRow 
column then we want - lastRow will be something like:
{code}{<clusteringDescendantAverages\[0\] * firstComponentCount>, 0, 0}{code}.

As a fix, could it instead set lastRow (for MultiRowIterators) to just the 
corresponding clusteringDescendantAverages values?

> cassandra-stress tool - clustering key values not distributed
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-11138
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11138
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>         Environment: Cassandra 2.2.4, Centos 6.5, Java 8
>            Reporter: Ralf Steppacher
>              Labels: stress
>
> I am trying to get the stress tool to generate random values for three 
> clustering keys. I am trying to simulate collecting events per user id (text, 
> partition key). Events have a session type (text), event type (text), and 
> creation time (timestamp) (clustering keys, in that order). For testing 
> purposes I ended up with the following column spec:
> {noformat}
> columnspec:
> - name: created_at
>   cluster: uniform(10..10)
> - name: event_type
>   size: uniform(5..10)
>   population: uniform(1..30)
>   cluster: uniform(1..30)
> - name: session_type
>   size: fixed(5)
>   population: uniform(1..4)
>   cluster: uniform(1..4)
> - name: user_id
>   size: fixed(15)
>   population: uniform(1..1000000)
> - name: message
>   size: uniform(10..100)
>   population: uniform(1..100B)
> {noformat}
> My expectation was that this would lead to anywhere between 10 and 1200 rows 
> to be created per partition key. But it seems that exactly 10 rows are being 
> created, with the {{created_at}} timestamp being the only variable that is 
> assigned variable values (per partition key). The {{session_type}} and 
> {{event_type}} variables are assigned fixed values. This is even the case if 
> I set the cluster distribution to uniform(30..30) and uniform(4..4) 
> respectively. With this setting I expected 1200 rows per partition key to be 
> created, as announced when running the stress tool, but it is still 10.
> {noformat}
> [rsteppac@centos bin]$ ./cassandra-stress user 
> profile=../batch_too_large.yaml ops\(insert=1\) -log level=verbose 
> file=~/centos_eventy_patient_session_event_timestamp_insert_only.log -node 
> 10.211.55.8
> …
> Created schema. Sleeping 1s for propagation.
> Generating batches with [1..1] partitions and [1..1] rows (of [1200..1200] 
> total rows in the partitions)
> Improvement over 4 threadCount: 19%
> ...
> {noformat}
> Sample of generated data:
> {noformat}
> cqlsh> select user_id, event_type, session_type, created_at from 
> stresscql.batch_too_large LIMIT 30 ;
> user_id                     | event_type       | session_type | created_at
> -----------------------------+------------------+--------------+--------------------------
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 2012-10-19 
> 08:14:11+0000
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 2004-11-08 
> 04:04:56+0000
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 2002-10-15 
> 00:39:23+0000
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 1999-08-31 
> 19:56:30+0000
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 1999-04-02 
> 20:46:26+0000
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 1990-10-08 
> 03:27:17+0000
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 1984-03-31 
> 23:30:34+0000
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 1975-11-16 
> 02:41:28+0000
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 1970-04-07 
> 07:23:48+0000
>   %\x7f\x03/.d29<i\$u\x114 | Y ?\x1eR|\x13\t| |     P+|u\x0b | 1970-03-08 
> 23:23:04+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 2015-10-12 
> 17:48:51+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 2010-10-28 
> 06:21:13+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 2005-06-28 
> 03:34:41+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 2005-01-29 
> 05:26:21+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 2003-03-27 
> 01:31:24+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 2002-03-29 
> 14:22:43+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 2000-06-15 
> 14:54:29+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 1998-03-08 
> 13:31:54+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 1988-01-21 
> 06:38:40+0000
>      N!\x0eUA7^r7d\x06J<v< |  \x1bm/c/Th\x07U |        E}P^k | 1975-08-03 
> 21:16:47+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 2014-11-23 
> 17:05:45+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 2012-02-23 
> 23:20:54+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 2012-02-19 
> 12:05:15+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 2005-10-17 
> 04:22:45+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 2003-02-24 
> 19:45:06+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 1996-12-18 
> 06:18:31+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 1991-06-10 
> 22:07:45+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 1983-05-05 
> 12:29:09+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 1972-04-17 
> 21:24:52+0000
> oy\x1c0077H"i\x07\x13_%\x06 |    | \nz@Qj\x1cB |        E}P^k | 1971-05-09 
> 23:00:02+0000
> (30 rows)
> cqlsh>
> {noformat}
> If I remove the {{created_at}} clustering key, then the other two clustering 
> keys are being assigned variable values per partition key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-11138) cassandra-stress tool - clustering key values not distributed

Reply via email to