[ 
https://issues.apache.org/jira/browse/CASSANDRA-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309535#comment-14309535
 ] 

Benedict commented on CASSANDRA-8692:
-------------------------------------

On the whole I like the patch, but I still have a lot of suggestions, and think 
it should be changed quite a bit before commit.

first, minor suggestions/nits:
# The bug you spotted some time ago with .size() being passed in to drainTo() 
is still present.
# For clarity, make DISABLE_COALESCING a strategy, instead of a parameter? 
Fewer conditional statements in the code, and it'slikely the optimizer will 
eliminate these calls anyway, since the strategy is final and the body will be 
a no-op.
# VM options might be good to prefix with "cassandra." or similar - we should 
decide a uniform approach to this
# parkLoop probably shouldn't loop if it's within a few % of the target sleep 
time if on a second iteration (might explain your double-sleeping?)
# The haveAllSamples() check seems unnecessary - unless our max coalesce window 
is >= 134 millis, we should never use the dummy samples?

It also seems to me that it is never useful to not drain the queue; it's not 
like it costs us very much by proportion to anything else, and the current 
behaviour is that it is always drained anyway (since we always "block", due to 
the bug mentioned in nit #1). We can simplify the code in this method quite a 
bit with this assumption.

More substantively:

It seems that we're missing some useful information to help us make better 
decisions by only providing the first message we drained to our strategy prior 
to blocking. IMO, we should:

# Immediately drain the queue, and notify of all sample data points
# Then ask the strategy if we want to sleep; and we should provide it the 
current outstanding message count to help make this decision

The current moving average strategy admittedly would not work well with this 
setup, because it only bases its calculation on the most recent 16 samples, and 
so providing it with extra samples upfront most likely won't help, and might 
even hurt. This seems counterintuitive behaviour-wise though, as sleeping for 
some period based on the guessed arrival time of the next message when that 
message is potentially already in our buffer strikes me as odd; as it is that 
providing more data points might make the decision worse. I suspect that the 
first and last messages in a batch dominate the behaviour, and it is probably 
similar to only tracking these and sleeping for some ratio of the delta.

I would like to propose a _slightly_ more involved strategy, that calculates 
the moving average over time horizons instead of count horizons. For 
simplicity, let's say we start with a linear discretization of time (i.e. split 
the horizon into even sized buckets). Let's also assume that the 1s average is 
a good barometer. So your 16 buckets would each represent the number of 
messages that arrived in the prior 16 intervals of 62.5ms (with the boundary 
bucket not utilised until we cross it). The interesting bit is then how we use 
this information. Let's also assume a uniform message arrival distribution 
(again, this can be improved later, but won't likely yield much parameter 
tweaking can't).

This gives us an average delta between messages of Dns. Let's also say we have 
N messages in our buffer already.

I can think of two complementary logics to apply off the top of my head. There 
are plenty of other possible variants of this style of approach, especially if 
we integrate other data (such as network latencies), but these ones spring to 
mind as simple and likely yielding much of the potential.

Logic 1:
We have N messages waiting to send; there's no point delaying these unless we 
intend to appreciably increase N. So let's say we want to double the number of 
messages in N to make it worthwhile sleeping, otherwise we may as well send 
what we have without delay. So we calculate XD, and sleep for it ONLY if it is 
less than the max coalesce interval; if it's greater, we just send immediately. 
This multiplier can be configurable, but 2x seems likely to work pretty well.

Logic 2:
Our average arrival rate says that we should expect these N messages to have 
arrived over a timespan of NDns; if they actually arrived in less time (by 
checking the delta between the first/last), we assume this is a burst of 
activity and just flush immediately since the non-uniformity of arrival is in 
our favour. If the opposite is true we might be able to estimate the 
probability that such a cluster of messages will arrive in the next Xns, but 
this requires some further thought.

These can both be applied, with Logic 2 overriding Logic 1 if it says to 
short-circuit the waiting, say. We can then follow up with some research into 
some more advanced analysis of the data, and integration of other metrics to 
the decision.

Do these descriptions make sense, and sound reasonable to you?

> Coalesce intra-cluster network messages
> ---------------------------------------
>
>                 Key: CASSANDRA-8692
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8692
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Ariel Weisberg
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.4
>
>         Attachments: batching-benchmark.png
>
>
> While researching CASSANDRA-8457 we found that it is effective and can be 
> done without introducing additional latency at low concurrency/throughput.
> The patch from that was used and found to be useful in a real life scenario 
> so I propose we implement this in 2.1 in addition to 3.0.
> The change set is a single file and is small enough to be reviewable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to