[ 
https://issues.apache.org/jira/browse/CASSANDRA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-6199:
--------------------------------

    Attachment: ops.write.svg
                ops.read.svg
                old.write.rate.distribution.svg
                old.read.rate.distribution.svg
                new.write.rate.distribution.svg
                new.read.rate.distribution.svg

Attached are a number of graphs comparing the op rates for five like-for-like 
runs of both stress tools. ops.(read|write).svg both have y axis of ops per 
second, and x axis of time elapsed. All five runs are overlayed.

Firstly, the ops.read.svg demonstrates two things: the new stress is faster, 
and, more importantly it exposes a fairly pathological bug in the old stress 
whereby some threads terminate early, reducing the op rate for some period at 
the end. I artificially induced this behaviour here by creating an unbalanced 
cluster, as I was having surprising difficulty reproducing it well enough to 
produce a convincing graph. I'm not sure what changed in my config to reduce 
its occurrence, but this bug can and would strike randomly, so is best 
eliminated either way. It would (did) not produce nice clean reproducible tails 
like this test did. These graphs also show another bug with the old stress, 
which is its overstatement of the actual op rate, by up to 10%.

The ops.write is a bit messier, but it is easy to see that the peak rate for 
the new stress is substantially higher, much else isn't clear though.

The distribution graphs are normalised, and help demonstrate that the variance 
of the results at least for reads is also lower in the new stress. For reads, 
the peaks are more evenly distributed around the mean. For writes, the 
*adjusted* op rate (which is the op rate minus any global pauses detected by 
stress) is clearly (almost) normally distributed - this isn't useful in and of 
itself, obviously, but does demonstrate that the pause detection is working and 
that the stderr calculations should be safe, meaning it can run until it can 
safely  gaurantee that the average op rate is within a requested confidence 
range.



> Improve Stress Tool
> -------------------
>
>                 Key: CASSANDRA-6199
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6199
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: Minor
>         Attachments: new.read.rate.distribution.svg, 
> new.write.rate.distribution.svg, old.read.rate.distribution.svg, 
> old.write.rate.distribution.svg, ops.read.svg, ops.write.svg
>
>
> The stress tool could do with sprucing up. The following is a list of 
> essential improvements and things that would be nice to have.
> Essential:
> - Reduce variability of results, especially start/end tails. Do not trash 
> first/last 10% of readings
> - Reduce contention/overhead in stress to increase overall throughput
> - Short warm-up period, which is ignored for summary (or summarised 
> separately), though prints progress as usual. Potentially automatic detection 
> of rate levelling.
> - Better configurability and defaults for data generation - current column 
> generation populates columns with the same value for every row, which is very 
> easily compressible. Possibly introduce partial random data generator 
> (possibly dictionary-based random data generator)
> Nice to have:
> - Calculate and print stdev and mean
> - Add batched sequential access mode (where a single thread performs 
> batch-size sequential requests before selecting another random key) to test 
> how key proximity affects performance
> - Auto-mode which attempts to establish the maximum throughput rate, by 
> varying the thread count (or otherwise gating the number of parallel 
> requests) for some period, then configures rate limit or thread count to test 
> performance at e.g. 30%, 50%, 70%, 90%, 120%, 150% and unconstrained.
> - Auto-mode could have a target variance ratio for mean throughput and/or 
> latency, and completes a test once this target is hit for x intervals
> - Fix key representation so independent of number of keys (possibly switch to 
> 10 digit hex), and don't use String.format().getBytes() to construct it 
> (expensive)
> Also, remove the skip-key setting, as it is currently ignored. Unless 
> somebody knows the reason for it.
> - Fix latency stats
> - Read/write mode, with configurable recency-of-reads distribution
> - Add new exponential/extreme value distribution for value size, column count 
> and recency-of-reads
> - Support more than 2^31 keys
> - Supports multiple concurrent stress inserts via key-offset parameter or 
> similar



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to