[
https://issues.apache.org/jira/browse/CASSANDRA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Bailis updated CASSANDRA-4261:
------------------------------------
Summary: [patch] Support consistency-latency prediction in nodetool (was:
[Patch] Support consistency-latency prediction in nodetool)
> [patch] Support consistency-latency prediction in nodetool
> ----------------------------------------------------------
>
> Key: CASSANDRA-4261
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4261
> Project: Cassandra
> Issue Type: New Feature
> Components: Tools
> Affects Versions: 1.2
> Reporter: Peter Bailis
> Attachments: pbs-nodetool-v1.patch
>
>
> h3. Introduction
> Cassandra supports a variety of replication configurations: ReplicationFactor
> is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting
> {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong
> consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or
> {{THREE}}. What should users choose?
> This patch provides a latency-consistency analysis within {{nodetool}}. Users
> can accurately predict Cassandra's behavior in their production environments
> without interfering with performance.
> What's the probability that we'll read a write t seconds after it completes?
> What about reading one of the last k writes? This patch provides answers via
> {{nodetool predictconsistency}}:
> {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
> \\ \\
> {code:title=Example output|borderStyle=solid}
> //N == ReplicationFactor
> //R == read ConsistencyLevel
> //W == write ConsistencyLevel
> user@test:$ nodetool predictconsistency 3 100 1
> 100ms after a given write, with maximum version staleness of k=1
> N=3, R=1, W=1
> Probability of consistent reads: 0.811700
> Average read latency: 6.896300ms (99.900th %ile 174ms)
> Average write latency: 8.788000ms (99.900th %ile 252ms)
> N=3, R=1, W=2
> Probability of consistent reads: 0.867200
> Average read latency: 6.818200ms (99.900th %ile 152ms)
> Average write latency: 33.226101ms (99.900th %ile 420ms)
> N=3, R=1, W=3
> Probability of consistent reads: 1.000000
> Average read latency: 6.766800ms (99.900th %ile 111ms)
> Average write latency: 153.764999ms (99.900th %ile 969ms)
> N=3, R=2, W=1
> Probability of consistent reads: 0.951500
> Average read latency: 18.065800ms (99.900th %ile 414ms)
> Average write latency: 8.322600ms (99.900th %ile 232ms)
> N=3, R=2, W=2
> Probability of consistent reads: 0.983000
> Average read latency: 18.009001ms (99.900th %ile 387ms)
> Average write latency: 35.797100ms (99.900th %ile 478ms)
> N=3, R=3, W=1
> Probability of consistent reads: 0.993900
> Average read latency: 101.959702ms (99.900th %ile 1094ms)
> Average write latency: 8.518600ms (99.900th %ile 236ms)
> {code}
> h3. Demo
> Here's an example scenario you can run using
> [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:
> {code:borderStyle=solid}
> cd <cassandra-source-dir with patch applied>
> ant
> # turn on consistency logging
> sed -i .bak 's/log_latencies_for_consistency_prediction:
> false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml
> ccm create consistencytest --cassandra-dir=.
> ccm populate -n 5
> ccm start
> # if start fails, you might need to initialize more loopback interfaces
> # e.g., sudo ifconfig lo0 alias 127.0.0.2
> # use stress to get some sample latency data
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read
> bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
> {code}
> h3. What and Why
> We've implemented [Probabilistically Bounded
> Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting
> consistency-latency trade-offs within Cassandra. Our
> [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB
> 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range
> of Dynamo-style data store deployments at places like LinkedIn and Yammer in
> addition to profiling our own Cassandra deployments. In our experience,
> prediction is both accurate and much more lightweight than profiling and
> manually testing each possible replication configuration (especially in
> production!).
> This analysis is important for the many users we've talked to and heard about
> who use "partial quorum" operation (e.g., non-{{QUORUM}}
> {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely
> depends on their runtime environment and, short of profiling in production,
> there's no existing way to answer these questions. (Keep in mind, Cassandra
> defaults to CL={{ONE}}, meaning users don't know how stale their data will
> be.)
> We outline limitations of the current approach after describing how it's
> done. We believe that this is a useful feature that can provide guidance and
> fairly accurate estimation for most users.
> h3. Interface
> This patch allows users to perform this prediction in production using
> {{nodetool}}.
> Users enable tracing of latency data by setting
> {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.
> Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies.
> Each latency is 8 bytes, and there are 4 distributions we require, so the
> space overhead is {{32*logged_latencies}} bytes of memory for the predicting
> node.
> {{nodetool predictconsistency}} predicts the latency and consistency for each
> possible {{ConsistencyLevel}} setting (reads and writes) by running
> {{number_trials_for_consistency_prediction}} Monte Carlo trials per
> configuration.
> Users shouldn't have to touch these parameters, and the defaults work well.
> The more latencies they log, the better the predictions will be.
> h3. Implementation
> This patch is fairly lightweight, requiring minimal changes to existing code.
> The high-level overview is that we gather empirical latency distributions
> then perform Monte Carlo analysis using the gathered data.
> h4. Latency Data
> We log latency data in {{service.PBSPredictor}}, recording four relevant
> distributions:
> * *W*: time from when the coordinator sends a mutation to the time that a
> replica begins to serve the new value(s)
> * *A*: time from when a replica accepting a mutation sends an
> * *R*: time from when the coordinator sends a read request to the time that
> the replica performs the read
> * *S*: time from when the replica sends a read response to the time when the
> coordinator receives it
> We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along
> with every message (8 bytes overhead required for millisecond {{long}}). In
> {{net.MessagingService}}, we log the start of every mutation and read, and,
> in {{net.ResponseVerbHandler}}, we log the end of every mutation and read.
> Jonathan Ellis mentioned that
> [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar
> latency tracing, but, as far as we can tell, these latencies aren't in that
> patch. We use an LRU policy to bound the number of latencies we track for
> each distribution.
> h4. Prediction
> When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}},
> which performs the actual Monte Carlo analysis based on the provided data.
> It's straightforward, and we've commented this analysis pretty well but can
> elaborate more here if required.
> h4. Testing
> We've modified the unit test for {{SerializationsTest}} and provided a new
> unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the
> {{PBSPredictor}} test with {{ant pbs-test}}.
> h4. Overhead
> This patch introduces 8 bytes of overhead per message. We could reduce this
> overhead and add timestamps on-demand, but this would require changing
> {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is
> messy.
> If enabled, consistency tracing requires {{32*logged_latencies}} bytes of
> memory on the node on which tracing is enabled.
> h3. Caveats
> The predictions are conservative, or worst-case, meaning we may predict more
> staleness than in practice in the following ways:
> * We do not account for read repair.
> * We do not account for Merkle tree exchange.
> * Multi-version staleness is particularly conservative.
> * We simulate non-local reads and writes. We assume that the coordinating
> Cassandra node is not itself a replica for a given key.
> The predictions are optimistic in the following ways:
> * We do not predict the impact of node failure.
> * We do not model hinted handoff.
> Predictions are only as good as the collected latencies. Generally, the more
> latencies that are collected, the better, but if the environment or workload
> changes, things might change. Also, we currently don't distinguish between
> column families or value sizes. This is doable, but it adds complexity to the
> interface and possibly more storage overhead.
> We can potentially improve these if there's interest, but this is an area of
> active research.
> ----
> Peter Bailis and Shivaram Venkataraman
> [[email protected]|mailto:[email protected]]
> [[email protected]|mailto:[email protected]]
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira