[
https://issues.apache.org/jira/browse/CASSANDRA-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744309#comment-14744309
]
Benedict commented on CASSANDRA-9870:
-------------------------------------
How is this progressing? When do you think we'll have some example graphs to
take a look at?
> Improve cassandra-stress graphing
> ---------------------------------
>
> Key: CASSANDRA-9870
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9870
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
> Reporter: Benedict
> Assignee: Shawn Kumar
> Attachments: reads.svg
>
>
> CASSANDRA-7918 introduces graph output from a stress run, but these graphs
> are a little limited. Attached to the ticket is an example of some improved
> graphs which can serve as the *basis* for some improvements, which I will
> briefly describe. They should not be taken as the exact end goal, but we
> should aim for at least their functionality. Preferably with some Javascript
> advantages thrown in, such as the hiding of datasets/graphs for clarity. Any
> ideas for improvements are *definitely* encouraged.
> Some overarching design principles:
> * Display _on *one* screen_ all of the information necessary to get a good
> idea of how two or more branches compare to each other. Ideally we will
> reintroduce this, painting multiple graphs onto one screen, stretched to fit.
> * Axes must be truncated to only the interesting dimensions, to ensure there
> is no wasted space.
> * Each graph displaying multiple kinds of data should use colour _and shape_
> to help easily distinguish the different datasets.
> * Each graph should be tailored to the data it is representing, and we should
> have multiple views of each data.
> The data can roughly be partitioned into three kinds:
> * throughput
> * latency
> * gc
> These can each be viewed in different ways:
> * as a continuous plot of:
> ** raw data
> ** scaled/compared to a "base" branch, or other metric
> ** cumulatively
> * as box plots
> ** ideally, these will plot median, outer quartiles, outer deciles and
> absolute limits of the distribution, so the shape of the data can be best
> understood
> Each compresses the information differently, losing different information, so
> that collectively they help to understand the data.
> Some basic rules for presentation that work well:
> * Latency information should be plotted to a logarithmic scale, to avoid high
> latencies drowning out low ones
> * GC information should be plotted cumulatively, to avoid differing
> throughputs giving the impression of worse GC. It should also have a line
> that is rescaled by the amount of work (number of operations) completed
> * Throughput should be plotted as the actual numbers
> To walk the graphs top-left to bottom-right, we have:
> * Spot throughput comparison of branches to the baseline branch, as an
> improvement ratio (which can of course be negative, but is not in this
> example)
> * Raw throughput of all branches (no baseline)
> * Raw throughput as a box plot
> * Latency percentiles, compared to baseline. The percentage improvement at
> any point in time vs baseline is calculated, and then multiplied by the
> overall median for the entire run. This simply permits the non-baseline
> branches to scatter their wins/loss around a relatively clustered line for
> each percentile. It's probably the most "dishonest" graph but comparing
> something like latency where each data point can have very high variance is
> difficult, and this gives you an idea of clustering of improvements/losses.
> * Latency percentiles, raw, each with a different shape; lowest percentiles
> plotted as a solid line as they vary least, with higher percentiles each
> getting their own subtly different shape to scatter.
> * Latency box plots
> * GC time, plotted cumulatively and also scaled by work done
> * GC Mb, plotted cumulatively and also scaled by work done
> * GC time, raw
> * GC time as a box plot
> These do mostly introduce the concept of a "baseline" branch. It may be that,
> ideally, this baseline be selected by a dropdown so the javascript can
> transform the output dynamically. This would permit more interesting
> comparisons to be made on the fly.
> There are also some complexities, such as deciding which datapoints to
> compare against baseline when times get out-of-whack (due to GC, etc, causing
> a lack of output for a period). The version I uploaded does a merge of the
> times, permitting a small degree of variance, and ignoring those datapoints
> we cannot pair. One option here might be to change stress' behaviour to
> always print to a strict schedule, instead of trying to get absolutely
> accurate apportionment of timings. If this makes things much simpler, it can
> be done.
> As previously stated, but may be lost in the wall-of-text, these should be
> taken as a starting point / sign post, rather than a golden rule for the end
> goal. But ideally they will be the lower bound of what we can deliver.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)