[jira] [Updated] (CASSANDRA-13932) Stress write order and seed order should be different

Daniel Cranford (JIRA) Tue, 03 Oct 2017 13:27:31 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Cranford updated CASSANDRA-13932:
----------------------------------------
    Summary: Stress write order and seed order should be different  (was: Write 
order and seed order should be different)

> Stress write order and seed order should be different
> -----------------------------------------------------
>
>                 Key: CASSANDRA-13932
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13932
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>            Reporter: Daniel Cranford
>              Labels: stress
>         Attachments: 0001-Initial-implementation-cassandra-3.11.patch, 
> vmtouch-after.txt, vmtouch-before.txt
>
>
> Read tests get an unrealistic boost in performance because they read data 
> from a set of partitions that was written sequentially.
> I ran into this while running a timed read test against a large data set (250 
> million partition keys) {noformat}cassandra-stress read 
> duration=30m{noformat} While the test was running, I noticed one node was 
> performing zero IO after an initial period.
> I discovered each node in the cluster only had blocks from a single SSTable 
> loaded in the FS cache. {noformat}vmtouch -v /path/to/sstables{noformat}
> For the node that was performing zero IO, the SSTable in question was small 
> enough to fit into the FS cache.
> I realized that when a read test is run for a duration or until rate 
> convergenge, the default population for the seeds is a GAUSSIAN distribution 
> over the first million seeds. Because of the way compaction works, partitions 
> that are written sequentially will (with high probability) always live in the 
> same SSTable. That means that while the first million seeds will generate 
> partition keys that will be randomly distributed in the token space, they 
> will most likely all live in the same SSTable. When this SSTable is small 
> enough to fit into the FS cache, you get unbelievably good results for a read 
> test. Consider that a dataset 4x the size of the FS cache will have almost 
> 1/2 the data in SSTables small enough to fit into the FS cache.
> Adjusting the population of seeds used during the read test to be the entire 
> 250 million seeds used to load the cluster does not fix the 
> problem.{noformat}cassandra-stress read duration=30m -pop 
> dist=gaussian(1..250M){noformat}
> or (same population, larger sample) {noformat}cassandra-stress read 
> n=250M{noformat}
> Any distribution other than the uniform distribution has one or more modes, 
> and the mode(s) of such a distribution will cluster reads around a certain 
> seed range which corresponds to a certain set of sequential writes which 
> corresponds to (with high probability) a single SSTable.
> My patch against cassandra-3.11 fixes this by shuffling the sequence of 
> generated seeds. Each seed value will still be generated once and only once. 
> The old behavior of sequential seed generation (ie seed(n+1) = seed( n) + 1) 
> may be selected by using the no-shuffle flag. e.g. {noformat}cassandra-stress 
> read duration=30m -pop no-shuffle{noformat}
> Results: In [^vmtouch-before.txt] only pages from a single SSTable are 
> present in the FS cache while in [^vmtouch-after.txt] an equal proportion of 
> all SSTables are present in the FS cache.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-13932) Stress write order and seed order should be different

Reply via email to