[
https://issues.apache.org/jira/browse/CASSANDRA-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182702#comment-15182702
]
Stefania commented on CASSANDRA-10331:
--------------------------------------
The cassandra-stress tool has been extended with a new user defined operation
to perform bulk read queries. Here are the links to the patch and CI results:
|[patch|https://github.com/stef1927/cassandra/commits/10331]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10331-testall/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10331-dtest/]|
Each query sweeps an entire token range using the following CQL syntax:
{code}
SELECT token(pk), a, b, c FROM table WHERE token(pk) > X AND token(pk) <= Y
{code}
Here a,b and c are the columns selected by the user and pk is the partition
key. {{token(pk)}} is added in order to count partitions.
Users define a schema and add token range queries as follows:
{code}
token_range_queries:
all_columns_tr_query:
columns: '*'
page_size: 5000
split_factor: 1
{code}
For each query the columns must be specified. Optionally, the page size and
split factor can be specified. Each stress operation retrieves one page of a
token range according to the page size specified by the user. The split factor
can be used to split token ranges. This may be useful when VNODES are not used
or to exploit thread parallelism further.
The test cycles across token ranges, and each thread picks a token range. Two
threads may download the same range in parallel if there aren't enough ranges
to cycle through, but they will not share a token range retrieval. Each thread
will download the token range page by page, with each stress operation
downloading a single page. Once all pages have been retrieved, the thread moves
on to the next range. If we run out of ranges we start again with the first
one. When all iterations have been performed, or the test duration has been
reached, the test halts.
I've created some user defined schemas in [this
benchmark|https://github.com/stef1927/cstar_bulk_read_test] where the row size
changes from 100 bytes to 10 MB and the clustering size changes from 1 to 1000
rows. We can use these schemas as a starting point to measure bulk reading
performance.
> Establish and implement canonical bulk reading workload(s)
> ----------------------------------------------------------
>
> Key: CASSANDRA-10331
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10331
> Project: Cassandra
> Issue Type: Sub-task
> Reporter: Ariel Weisberg
> Assignee: Stefania
> Fix For: 3.x
>
>
> Implement a client, use stress, or extend stress to a bulk reading workload
> that is indicative of the performance we are trying to improve.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)