[ https://issues.apache.org/jira/browse/CASSANDRA-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Cranford updated CASSANDRA-13932: ---------------------------------------- Summary: Stress write order and seed order should be different (was: Write order and seed order should be different) > Stress write order and seed order should be different > ----------------------------------------------------- > > Key: CASSANDRA-13932 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13932 > Project: Cassandra > Issue Type: Bug > Components: Tools > Reporter: Daniel Cranford > Labels: stress > Attachments: 0001-Initial-implementation-cassandra-3.11.patch, > vmtouch-after.txt, vmtouch-before.txt > > > Read tests get an unrealistic boost in performance because they read data > from a set of partitions that was written sequentially. > I ran into this while running a timed read test against a large data set (250 > million partition keys) {noformat}cassandra-stress read > duration=30m{noformat} While the test was running, I noticed one node was > performing zero IO after an initial period. > I discovered each node in the cluster only had blocks from a single SSTable > loaded in the FS cache. {noformat}vmtouch -v /path/to/sstables{noformat} > For the node that was performing zero IO, the SSTable in question was small > enough to fit into the FS cache. > I realized that when a read test is run for a duration or until rate > convergenge, the default population for the seeds is a GAUSSIAN distribution > over the first million seeds. Because of the way compaction works, partitions > that are written sequentially will (with high probability) always live in the > same SSTable. That means that while the first million seeds will generate > partition keys that will be randomly distributed in the token space, they > will most likely all live in the same SSTable. When this SSTable is small > enough to fit into the FS cache, you get unbelievably good results for a read > test. Consider that a dataset 4x the size of the FS cache will have almost > 1/2 the data in SSTables small enough to fit into the FS cache. > Adjusting the population of seeds used during the read test to be the entire > 250 million seeds used to load the cluster does not fix the > problem.{noformat}cassandra-stress read duration=30m -pop > dist=gaussian(1..250M){noformat} > or (same population, larger sample) {noformat}cassandra-stress read > n=250M{noformat} > Any distribution other than the uniform distribution has one or more modes, > and the mode(s) of such a distribution will cluster reads around a certain > seed range which corresponds to a certain set of sequential writes which > corresponds to (with high probability) a single SSTable. > My patch against cassandra-3.11 fixes this by shuffling the sequence of > generated seeds. Each seed value will still be generated once and only once. > The old behavior of sequential seed generation (ie seed(n+1) = seed( n) + 1) > may be selected by using the no-shuffle flag. e.g. {noformat}cassandra-stress > read duration=30m -pop no-shuffle{noformat} > Results: In [^vmtouch-before.txt] only pages from a single SSTable are > present in the FS cache while in [^vmtouch-after.txt] an equal proportion of > all SSTables are present in the FS cache. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org