[
https://issues.apache.org/jira/browse/CASSANDRA-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310236#comment-14310236
]
Benedict commented on CASSANDRA-8692:
-------------------------------------
This is the kind of discussion it would really help to have over a whiteboard,
or perhaps over video chat secondarily. I'll do my best to explain my position,
but it's verbose and it seems I already failed to portray it the first time.
bq. The timing of providing messages to the average doesn't matter since the
timestamp is generated by the submitter not the sending thread
It matters because this moving average is over a fixed number of messages,
rather than a fixed time interval. So the order of provision wrt querying
matters a lot, because a (non-uniformly-distributed) close clustering of
messages will have smaller than average delta between them. A single batch you
process at once is likely to fit this bill. The converse is also true; the
first message in a batch is more likely to be abnormally distant from the prior
messages. So your implementation depends on the assumption that the message
arrival is uniform, just implicitly :)
The main thing is that the non-uniformity of message arrival is highly unlikely
to average out over 16 messages, whereas over a 1s time horizon it is more
likely, and so the assumption is perhaps likely to lead to more reasonable
calculations? With a 150K/s message arrival rates (estimated from your
numbers), that means the average is calculated over, on average, 100us, which
is a very short time horizon to extract any predictive power.
bq. I am on the fence as to whether something smarter than a moving average
counts as scope creep.
I'm still only suggesting a moving average, just one measured over a fixed time
horizon, instead of a fixed number of measurements. I'm also suggesting a
different strategy for utilising this calculation, one that uses more
information at our disposal. The two suggestions are somewhat orthogonal, and
you could implement each change independently. Right now your algorithm is:
average over most recent 16 messages, and wait as long as the average delta
between the last 16, which we hope gives us one more message to coalesce. Mine
is 1) to make the average calculation more robust to fluctuations in arrival
rates (but still decay rapidly); and 2) make a decision on whether to coalesce
based on the amount of potential "win" we will get; as the number of messages
we have to flush grows the benefit to waiting declines, and we also potentially
wait longer if there is the expectation we can flush a significant cluster of
messages at once. Thus it should (theoretically) lead to less delay when
unhelpful, and more when helpful.
bq. It seems like trying to do better would help bring down average latency by
100, maybe 200, microseconds when latency is 2+ milliseconds.
Well, my aim isn't so much to bring down latency further but to make the
algorithm understandable to me. It strikes me that its behaviour isn't
necessarily emerging from its design, but from other correlated factors. Take a
random data point in your results for illustration: 400K messages with a
coalesce window of 100us; guessing your RF=5 and CL=ALL from your other
comments, it looks like each node would have 150K/s traffic, which translates
to ~7us average latency between message arrival. Which means you should only be
waiting 7us to coalesce based on your algorithm, on average, but then why is it
so much faster than a maximum coalesce window of 6us, 12us and 25us? That means
it's regularly predicting an average message delay of 100us, which is not
correct, so it's not clear to me what's actually driving its results. My
concern, as a result, is that this might not actually translate into as much
advantage in the field as it does in the lab, in which we do produce a very
steady load.
My handwavy suspicion right now is that the non-uniformity of message arrivals
leads us to some kind of phased work completion, where the delta between the
phases is large enough to tip our calculation to delay, which steadily leads to
phased work provision from the client since they behave synchronously. This
might also explain why it takes a few minutes for the cluster to reach its
performance steady state - which doesn't otherwise make much sense, which you
also highlighted your confusion over.
Either way, fewer unecessary delays also translates into higher throughput, not
just reduced latency. However mostly I want a beast I can explain all of the
properties and behaviours of.
> Coalesce intra-cluster network messages
> ---------------------------------------
>
> Key: CASSANDRA-8692
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8692
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Ariel Weisberg
> Assignee: Ariel Weisberg
> Fix For: 2.1.4
>
> Attachments: batching-benchmark.png
>
>
> While researching CASSANDRA-8457 we found that it is effective and can be
> done without introducing additional latency at low concurrency/throughput.
> The patch from that was used and found to be useful in a real life scenario
> so I propose we implement this in 2.1 in addition to 3.0.
> The change set is a single file and is small enough to be reviewable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)