[
https://issues.apache.org/jira/browse/CASSANDRA-11853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297538#comment-15297538
]
T Jake Luciani edited comment on CASSANDRA-11853 at 5/24/16 2:15 AM:
---------------------------------------------------------------------
I'm re-running with multiple settings to see how it changes.
Looking at the code my main concern is the UniformRateLimiter.
If I understand the code correctly UniformRateLimiter linearly scales the
ops/sec from the time the limiter was constructed, so when an operation is
ready to run it gets it's expected start time based on the absolute operation
number it is.
I see two problems with this:
* The rate limiter is created at startup and doesn't account for
warmup/hotspot etc. So once warmed up the ops are behind. This explains the
[initial latency
spike|http://cstar.datastax.com/graph?command=one_job&stats=022678d8-2123-11e6-bcd7-0256e416528f&metric=99th_latency&operation=3_read&smoothing=1&show_aggregates=true&xmin=0&xmax=549.67&ymin=0&ymax=318.01]
in the run which skew the overall results. The limiter start time should
only be set once the actual measured ops are ready to start.
* If the rate limit is set too high, such that stress can't keep up with the
expected rate, the results will make no sense. The actual start time will be
way after the limiters calculated start time.
It would be very good if we could add some way of detecting local GC pauses
like we do in
[GCInspector|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/GCInspector.java]
otherwise users have no way of knowing if the latency is due to local pauses
or server pauses.
General comments/nits on the branch:
* The [code style|https://wiki.apache.org/cassandra/CodeStyle] needs to be
fixed (break on bracket etc)
* HdrHistogram needs to also be added to the build.xml maven/pom dependencies
* Comments on top level classes like UniformRateLimiter would be helpful for
future readers.
was (Author: tjake):
I'm re-running with multiple settings to see how it changes.
Looking at the code my main questions is the UniformRateLimiter.
If I understand the code correctly UniformRateLimiter linearly scales the
ops/sec from the time the limiter was constructed, so when an operation is
ready to run it gets it's expected start time based on the absolute operation
number it is.
I see two problems with this:
* The rate limiter is created at startup and doesn't account for
warmup/hotspot etc. So once warmed up the ops are behind. This explains the
[initial latency
spike|http://cstar.datastax.com/graph?command=one_job&stats=022678d8-2123-11e6-bcd7-0256e416528f&metric=99th_latency&operation=3_read&smoothing=1&show_aggregates=true&xmin=0&xmax=549.67&ymin=0&ymax=318.01]
in the run which skew the overall results. The limiter start time should
only be set once the actual measured ops are ready to start.
* If the rate limit is set too high, such that stress can't keep up with the
expected rate, the results will make no sense. The actual start time will be
way after the limiters calculated start time.
It would be very good if we could add some way of detecting local GC pauses
like we do in
[GCInspector|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/GCInspector.java]
otherwise users have no way of knowing if the latency is due to local pauses
or server pauses.
General comments/nits on the branch:
* The [code style|https://wiki.apache.org/cassandra/CodeStyle] needs to be
fixed (break on bracket etc)
* HdrHistogram needs to also be added to the build.xml maven/pom dependencies
* Comments on top level classes like UniformRateLimiter would be helpful for
future readers.
> Improve Cassandra-Stress latency measurement
> --------------------------------------------
>
> Key: CASSANDRA-11853
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11853
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
> Reporter: Nitsan Wakart
> Assignee: Nitsan Wakart
> Fix For: 3.x
>
>
> Currently CS reports latency using a sampling latency container and reporting
> service time (as opposed to response time from intended schedule) leading to
> coordinated omission.
> Fixed here:
> https://github.com/nitsanw/cassandra/tree/co-correction
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)