[ 
https://issues.apache.org/jira/browse/CASSANDRA-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184688#comment-15184688
 ] 

Stefania commented on CASSANDRA-10331:
--------------------------------------

bq. I like it!

Thank you! :)

bq. Just so I understand. If I insert 100k items and set page size 1k I will do 
one loop around the tr in 100 iterations right (with 1 thread) 

Yes.

bq. with 100 threads all ranges will be split evenly?

I actually changed the implementation and introduced {{TokenRangeIterator}} in 
{{stress.generate}}, see the [latest 
commit|https://github.com/stef1927/cassandra/commit/f6f6b4c74634b7b40b4723da881ef568830406de].
 I hadn't realized each thread creates a new operation instance so I mistakenly 
assumed they would share the {{pendingRanges}} queue in 
{{TokenRangeOperation}}. Now that the queue has been moved to 
{{TokenRangeIterator}}, they do share it. Each thread takes a token range and 
processes it. Once it is completed it takes the next available token range, if 
any. So the load should be split evenly across threads provided there are 
enough token ranges, but a thread may process more ranges if they contain fewer 
partitions. 

bq. How does split_factor relate to threads? Just allocates more splits for 
threads to work on at once?

Correct, by increasing the number of ranges we increase the opportunity for 
more threads to work in parallel. It is mostly intended for when there are 
fewer VNODES than stress threads. 

When working on {{TokenRangeIterator}} it felt more appropriate to move the 
split factor from the query definition into a command line parameter. I 
introduced a new settings class for this, {{SettingsTokenRange}}, which we also 
use to decide if we should wrap token ranges, more below.

bq. If we wanted to simulate fetching all data uniquely ONLY once can we do 
that using -pop no-wrap seq=1..N where N is the insert count?

I've added a wrap parameter to {{SettingsTokenRange}}, it can be specified as 
follows: {{-tokenrange \[wrap\] \[split-factor=n\]}}. By default we no longer 
wrap, so we only get the entire table content unless we specify {{-tokenrange 
wrap}}. This simplifies things because the user no longer needs to specify 
number of iterations or duration and typically we want to measure performance 
in retrieving the entire table content. We could move {{split-factor}} to 
{{SettingsPopulation.SequentialOptions}}  in order to re-use {{no-wrap}} but it 
felt a bit strange to force the user to specify a sequence that is otherwise 
not needed. Also the description of {{no-wrap}} would be a bit confusing (seeds 
vs token ranges). Further, the default behavior for {{SettingsPopulation.wrap}} 
is different depending on whether we use distribution or sequential options 
whilst range queries would need a consistent default behavior. We could 
introduce an entirely new options group for {{SettingsPopulation}} and scrap 
{{SettingsTokenRange}} but the name {{wrap}} would preferably have to be 
changed to something else for the reasons just outlined.

bq. In terms of code changes the only thing worth mentioning is you should add 
a check that the same label isn't used in queries AND token_range_queries.

It's done, thank you. I've also renamed {{bulk_read_queries}} to 
{{tokenRangeQueries}} in {{SettingsProfile}} and ensured that it cannot be null.

> Establish and implement canonical bulk reading workload(s)
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-10331
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10331
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Ariel Weisberg
>            Assignee: Stefania
>             Fix For: 3.x
>
>
> Implement a client, use stress, or extend stress to a bulk reading workload 
> that is indicative of the performance we are trying to improve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to