[jira] [Commented] (CASSANDRA-10331) Establish and implement canonical bulk reading workload(s)

Stefania (JIRA) Sun, 06 Mar 2016 23:42:07 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182702#comment-15182702
 ]


Stefania commented on CASSANDRA-10331:
--------------------------------------

The cassandra-stress tool has been extended with a new user defined operation 
to perform bulk read queries. Here are the links to the patch and CI results:

|[patch|https://github.com/stef1927/cassandra/commits/10331]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10331-testall/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10331-dtest/]|

Each query sweeps an entire token range using the following CQL syntax:

{code}
SELECT token(pk), a, b, c FROM table WHERE token(pk) > X AND token(pk) <= Y
{code}

Here a,b and c are the columns selected by the user and pk is the partition 
key. {{token(pk)}} is added in order to count partitions. 

Users define a schema and add token range queries as follows:

{code}
token_range_queries:
  all_columns_tr_query:
    columns: '*'  
    page_size: 5000
    split_factor: 1 
{code}

For each query the columns must be specified. Optionally, the page size and 
split factor can be specified. Each stress operation retrieves one page of a 
token range according to the page size specified by the user. The split factor 
can be used to split token ranges. This may be useful when VNODES are not used 
or to exploit thread parallelism further. 

The test cycles across token ranges, and each thread picks a token range. Two 
threads may download the same range in parallel if there aren't enough ranges 
to cycle through, but they will not share a token range retrieval. Each thread 
will download the token range page by page, with each stress operation 
downloading a single page. Once all pages have been retrieved, the thread moves 
on to the next range. If we run out of ranges we start again with the first 
one. When all iterations have been performed, or the test duration has been 
reached, the test halts.

I've created some user defined schemas in [this 
benchmark|https://github.com/stef1927/cstar_bulk_read_test] where the row size 
changes from 100 bytes to 10 MB and the clustering size changes from 1 to 1000 
rows. We can use these schemas as a starting point to measure bulk reading 
performance.

> Establish and implement canonical bulk reading workload(s)
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-10331
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10331
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Ariel Weisberg
>            Assignee: Stefania
>             Fix For: 3.x
>
>
> Implement a client, use stress, or extend stress to a bulk reading workload 
> that is indicative of the performance we are trying to improve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10331) Establish and implement canonical bulk reading workload(s)

Reply via email to