GitHub user jolynch opened a pull request:
https://github.com/apache/cassandra/pull/283
CASSANDRA-14459: DynamicEndpointSnitch should never prefer latent replicas
This change incorporates the feedback from Ariel and Jason as part of
https://issues.apache.org/jira/browse/CASSANDRA-14459.
The following is introduced:
1. Fully pluggable DynamicEndpointSnitch so that we can continue
experimenting with new implementations
2. Instead of resetting every 10 minutes, the DES uses active latency
probes for replicas that it was asked to rank but has no recent data on. These
are rate limited by default to a single probe per second. These latency probes,
while not perfect, will correctly detect nodes that are latent due to network
conditions, JVM instability (gc/safepoint pauses), and Read threadpool
exhaustion.
3. A new opt-in implementation of the DES which uses an exponential moving
average instead of a Histogram. Both statistical measures try to develop a
noise reduced sample with different tradeoffs, but the main one in favor of DES
is that it reacts to extreme outliars faster (e.g. if a node is actively timing
out and dropping messages) and generates about 100x less garbage than the
histogram approach.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jolynch/cassandra CASSANDRA-14459
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/cassandra/pull/283.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #283
----
commit 850952dac3a7988252cb09072f5dbd226bda3430
Author: Joseph Lynch <joe.e.lynch@...>
Date: 2018-10-01T13:30:58Z
Avoid dropping all data in DynamicSnitch reset
Instead of throwing away all measurements every ten minutes, now we keep
the minimum value and allow "bad" measurements such as EchoMessage
responses to be kept just when the sample size is small (right after a
reset).
This prevents nodes from talking accross datacenters and makes it so that
when
nodes start up they get a latency landscape during the first round of gossip
commit 700f8c2e81221b4b18b6e012cfd33525d4861a91
Author: Joseph Lynch <joe.e.lynch@...>
Date: 2018-07-20T07:08:28Z
Send pings on a scheduled basis rather than from Gossiper
commit c6760e63b3682b00d11b0a8019cc9b7fda8b199f
Author: Joseph Lynch <joe.e.lynch@...>
Date: 2018-10-11T19:26:44Z
Makes the DES plugable and refactors it to be cleaner
In particular separates the DES components that manage updating the
scores from all the rest, allowing us to experiemnt safely with e.g.
EMAs instead of Histograms and other new approaches.
commit bb34644ef46d14332ca4f5fa561bf8411eab148f
Author: Joseph Lynch <joe.e.lynch@...>
Date: 2018-10-12T23:13:29Z
Add pluggable EMA based Snitch
Also refactors the test suite to test both implementations as well as
more closely testing the latency probe algorithm.
commit 753e4b86bde34194a5997c84046a1ceb67455337
Author: Joseph Lynch <joe.e.lynch@...>
Date: 2018-10-14T20:39:16Z
Make the DES more testable and benchmark the EMA vs Histogram approach
Using -prof gc I was able to show that the EMA approach is about 4-5x
faster and between 70-400x less garbage generated. Essentially the EMA
reacts a little bit slower than the histgoram, but is more tolerant of
noise and generlly is way more performant.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]