> I'm trying to understand if this is expected or not, and if there is

Without careful tuning, outliers around a couple of hundred ms are
definitely expected in general (not *necessarily*, depending on
workload) as a result of garbage collection pauses. The impact will be
worsened a bit if you are running under high CPU load (or even maxing
it out with stress) because post-pause, if you are close to max CPU
usage you will take considerably longer to "catch up".

Personally, I would just log each response time and feed it to gnuplot
or something. It should be pretty obvious whether or not the latencies
are due to periodic pauses.

If you are concerned with eliminating or reducing outliers, I would:

(1) Make sure that when you're benchmarking, that you're putting
Cassandra under a reasonable amount of load. Latency benchmarks are
usually useless if you're benchmarking against a saturated system. At
least, start by achieving your latency goals at 25% or less CPU usage,
and then go from there if you want to up it.

(2) One can affect GC pauses, but it's non-trivial to eliminate the
problem completely. For example, the length of frequent young-gen
pauses can typically be decreased by decreasing the size of the young
generation, leading to more frequent shorter GC pauses. But that
instead causes more promotion into the old generation, which will
result in more frequent very long pauses (relative to normal; they
would still be infrequent relative to young gen pauses) - IF your
workload is such that you are suffering from fragmentation and
eventually seeing Cassandra fall back to full compacting GC:s
(stop-the-world) for the old generation.

I would start by adjusting young gen so that your frequent pauses are
at an acceptable level, and then see whether or not you can sustain
that in terms of old-gen.

Start with this in any case: Run Cassandra with -XX:+PrintGC
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Reply via email to