[ 
https://issues.apache.org/jira/browse/CASSANDRA-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310236#comment-14310236
 ] 

Benedict commented on CASSANDRA-8692:
-------------------------------------

This is the kind of discussion it would really help to have over a whiteboard, 
or perhaps over video chat secondarily. I'll do my best to explain my position, 
but it's verbose and it seems I already failed to portray it the first time.

bq. The timing of providing messages to the average doesn't matter since the 
timestamp is generated by the submitter not the sending thread

It matters because this moving average is over a fixed number of messages, 
rather than a fixed time interval. So the order of provision wrt querying 
matters a lot, because a (non-uniformly-distributed) close clustering of 
messages will have smaller than average delta between them. A single batch you 
process at once is likely to fit this bill. The converse is also true; the 
first message in a batch is more likely to be abnormally distant from the prior 
messages. So your implementation depends on the assumption that the message 
arrival is uniform, just implicitly :)

The main thing is that the non-uniformity of message arrival is highly unlikely 
to average out over 16 messages, whereas over a 1s time horizon it is more 
likely, and so the assumption is perhaps likely to lead to more reasonable 
calculations? With a 150K/s message arrival rates (estimated from your 
numbers), that means the average is calculated over, on average, 100us, which 
is a very short time horizon to extract any predictive power.

bq. I am on the fence as to whether something smarter than a moving average 
counts as scope creep.

I'm still only suggesting a moving average, just one measured over a fixed time 
horizon, instead of a fixed number of measurements. I'm also suggesting a 
different strategy for utilising this calculation, one that uses more 
information at our disposal. The two suggestions are somewhat orthogonal, and 
you could implement each change independently. Right now your algorithm is: 
average over most recent 16 messages, and wait as long as the average delta 
between the last 16, which we hope gives us one more message to coalesce. Mine 
is 1) to make the average calculation more robust to fluctuations in arrival 
rates (but still decay rapidly); and 2) make a decision on whether to coalesce 
based on the amount of potential "win" we will get; as the number of messages 
we have to flush grows the benefit to waiting declines, and we also potentially 
wait longer if there is the expectation we can flush a significant cluster of 
messages at once. Thus it should (theoretically) lead to less delay when 
unhelpful, and more when helpful. 

bq. It seems like trying to do better would help bring down average latency by 
100, maybe 200, microseconds when latency is 2+ milliseconds.

Well, my aim isn't so much to bring down latency further but to make the 
algorithm understandable to me. It strikes me that its behaviour isn't 
necessarily emerging from its design, but from other correlated factors. Take a 
random data point in your results for illustration: 400K messages with a 
coalesce window of 100us; guessing your RF=5 and CL=ALL from your other 
comments, it looks like each node would have 150K/s traffic, which translates 
to ~7us average latency between message arrival. Which means you should only be 
waiting 7us to coalesce based on your algorithm, on average, but then why is it 
so much faster than a maximum coalesce window of 6us, 12us and 25us? That means 
it's regularly predicting an average message delay of 100us, which is not 
correct, so it's not clear to me what's actually driving its results. My 
concern, as a result, is that this might not actually translate into as much 
advantage in the field as it does in the lab, in which we do produce a very 
steady load.

My handwavy suspicion right now is that the non-uniformity of message arrivals 
leads us to some kind of phased work completion, where the delta between the 
phases is large enough to tip our calculation to delay, which steadily leads to 
phased work provision from the client since they behave synchronously. This 
might also explain why it takes a few minutes for the cluster to reach its 
performance steady state - which doesn't otherwise make much sense, which you 
also highlighted your confusion over.

Either way, fewer unecessary delays also translates into higher throughput, not 
just reduced latency. However mostly I want a beast I can explain all of the 
properties and behaviours of.

> Coalesce intra-cluster network messages
> ---------------------------------------
>
>                 Key: CASSANDRA-8692
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8692
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.4
>
>         Attachments: batching-benchmark.png
>
>
> While researching CASSANDRA-8457 we found that it is effective and can be 
> done without introducing additional latency at low concurrency/throughput.
> The patch from that was used and found to be useful in a real life scenario 
> so I propose we implement this in 2.1 in addition to 3.0.
> The change set is a single file and is small enough to be reviewable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to