[
https://issues.apache.org/jira/browse/CASSANDRA-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309535#comment-14309535
]
Benedict commented on CASSANDRA-8692:
-------------------------------------
On the whole I like the patch, but I still have a lot of suggestions, and think
it should be changed quite a bit before commit.
first, minor suggestions/nits:
# The bug you spotted some time ago with .size() being passed in to drainTo()
is still present.
# For clarity, make DISABLE_COALESCING a strategy, instead of a parameter?
Fewer conditional statements in the code, and it'slikely the optimizer will
eliminate these calls anyway, since the strategy is final and the body will be
a no-op.
# VM options might be good to prefix with "cassandra." or similar - we should
decide a uniform approach to this
# parkLoop probably shouldn't loop if it's within a few % of the target sleep
time if on a second iteration (might explain your double-sleeping?)
# The haveAllSamples() check seems unnecessary - unless our max coalesce window
is >= 134 millis, we should never use the dummy samples?
It also seems to me that it is never useful to not drain the queue; it's not
like it costs us very much by proportion to anything else, and the current
behaviour is that it is always drained anyway (since we always "block", due to
the bug mentioned in nit #1). We can simplify the code in this method quite a
bit with this assumption.
More substantively:
It seems that we're missing some useful information to help us make better
decisions by only providing the first message we drained to our strategy prior
to blocking. IMO, we should:
# Immediately drain the queue, and notify of all sample data points
# Then ask the strategy if we want to sleep; and we should provide it the
current outstanding message count to help make this decision
The current moving average strategy admittedly would not work well with this
setup, because it only bases its calculation on the most recent 16 samples, and
so providing it with extra samples upfront most likely won't help, and might
even hurt. This seems counterintuitive behaviour-wise though, as sleeping for
some period based on the guessed arrival time of the next message when that
message is potentially already in our buffer strikes me as odd; as it is that
providing more data points might make the decision worse. I suspect that the
first and last messages in a batch dominate the behaviour, and it is probably
similar to only tracking these and sleeping for some ratio of the delta.
I would like to propose a _slightly_ more involved strategy, that calculates
the moving average over time horizons instead of count horizons. For
simplicity, let's say we start with a linear discretization of time (i.e. split
the horizon into even sized buckets). Let's also assume that the 1s average is
a good barometer. So your 16 buckets would each represent the number of
messages that arrived in the prior 16 intervals of 62.5ms (with the boundary
bucket not utilised until we cross it). The interesting bit is then how we use
this information. Let's also assume a uniform message arrival distribution
(again, this can be improved later, but won't likely yield much parameter
tweaking can't).
This gives us an average delta between messages of Dns. Let's also say we have
N messages in our buffer already.
I can think of two complementary logics to apply off the top of my head. There
are plenty of other possible variants of this style of approach, especially if
we integrate other data (such as network latencies), but these ones spring to
mind as simple and likely yielding much of the potential.
Logic 1:
We have N messages waiting to send; there's no point delaying these unless we
intend to appreciably increase N. So let's say we want to double the number of
messages in N to make it worthwhile sleeping, otherwise we may as well send
what we have without delay. So we calculate XD, and sleep for it ONLY if it is
less than the max coalesce interval; if it's greater, we just send immediately.
This multiplier can be configurable, but 2x seems likely to work pretty well.
Logic 2:
Our average arrival rate says that we should expect these N messages to have
arrived over a timespan of NDns; if they actually arrived in less time (by
checking the delta between the first/last), we assume this is a burst of
activity and just flush immediately since the non-uniformity of arrival is in
our favour. If the opposite is true we might be able to estimate the
probability that such a cluster of messages will arrive in the next Xns, but
this requires some further thought.
These can both be applied, with Logic 2 overriding Logic 1 if it says to
short-circuit the waiting, say. We can then follow up with some research into
some more advanced analysis of the data, and integration of other metrics to
the decision.
Do these descriptions make sense, and sound reasonable to you?
> Coalesce intra-cluster network messages
> ---------------------------------------
>
> Key: CASSANDRA-8692
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8692
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Ariel Weisberg
> Assignee: Ariel Weisberg
> Fix For: 2.1.4
>
> Attachments: batching-benchmark.png
>
>
> While researching CASSANDRA-8457 we found that it is effective and can be
> done without introducing additional latency at low concurrency/throughput.
> The patch from that was used and found to be useful in a real life scenario
> so I propose we implement this in 2.1 in addition to 3.0.
> The change set is a single file and is small enough to be reviewable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)