[
https://issues.apache.org/jira/browse/CASSANDRA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456871#comment-13456871
]
Matt Blair commented on CASSANDRA-4261:
---------------------------------------
So now that CASSANDRA-1123 has been resolved, will this get merged in time for
1.2?
> [patch] Support consistency-latency prediction in nodetool
> ----------------------------------------------------------
>
> Key: CASSANDRA-4261
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4261
> Project: Cassandra
> Issue Type: New Feature
> Components: Tools
> Affects Versions: 1.2.0 beta 1
> Reporter: Peter Bailis
> Attachments: demo-pbs-v3.sh, pbs-nodetool-v3.patch
>
>
> h3. Introduction
> Cassandra supports a variety of replication configurations: ReplicationFactor
> is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting
> {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong
> consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or
> {{THREE}}. What should users choose?
> This patch provides a latency-consistency analysis within {{nodetool}}. Users
> can accurately predict Cassandra's behavior in their production environments
> without interfering with performance.
> What's the probability that we'll read a write t seconds after it completes?
> What about reading one of the last k writes? This patch provides answers via
> {{nodetool predictconsistency}}:
> {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
> \\ \\
> {code:title=Example output|borderStyle=solid}
> //N == ReplicationFactor
> //R == read ConsistencyLevel
> //W == write ConsistencyLevel
> user@test:$ nodetool predictconsistency 3 100 1
> Performing consistency prediction
> 100ms after a given write, with maximum version staleness of k=1
> N=3, R=1, W=1
> Probability of consistent reads: 0.678900
> Average read latency: 5.377900ms (99.900th %ile 40ms)
> Average write latency: 36.971298ms (99.900th %ile 294ms)
> N=3, R=1, W=2
> Probability of consistent reads: 0.791600
> Average read latency: 5.372500ms (99.900th %ile 39ms)
> Average write latency: 303.630890ms (99.900th %ile 357ms)
> N=3, R=1, W=3
> Probability of consistent reads: 1.000000
> Average read latency: 5.426600ms (99.900th %ile 42ms)
> Average write latency: 1382.650879ms (99.900th %ile 629ms)
> N=3, R=2, W=1
> Probability of consistent reads: 0.915800
> Average read latency: 11.091000ms (99.900th %ile 348ms)
> Average write latency: 42.663101ms (99.900th %ile 284ms)
> N=3, R=2, W=2
> Probability of consistent reads: 1.000000
> Average read latency: 10.606800ms (99.900th %ile 263ms)
> Average write latency: 310.117615ms (99.900th %ile 335ms)
> N=3, R=3, W=1
> Probability of consistent reads: 1.000000
> Average read latency: 52.657501ms (99.900th %ile 565ms)
> Average write latency: 39.949799ms (99.900th %ile 237ms)
> {code}
> h3. Demo
> Here's an example scenario you can run using
> [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:
> {code:borderStyle=solid}
> cd <cassandra-source-dir with patch applied>
> ant
> ccm create consistencytest --cassandra-dir=.
> ccm populate -n 5
> ccm start
> # if start fails, you might need to initialize more loopback interfaces
> # e.g., sudo ifconfig lo0 alias 127.0.0.2
> # use stress to get some sample latency data
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read
> bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
> {code}
> h3. What and Why
> We've implemented [Probabilistically Bounded
> Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting
> consistency-latency trade-offs within Cassandra. Our
> [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB
> 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range
> of Dynamo-style data store deployments at places like LinkedIn and Yammer in
> addition to profiling our own Cassandra deployments. In our experience,
> prediction is both accurate and much more lightweight than profiling and
> manually testing each possible replication configuration (especially in
> production!).
> This analysis is important for the many users we've talked to and heard about
> who use "partial quorum" operation (e.g., non-{{QUORUM}}
> {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely
> depends on their runtime environment and, short of profiling in production,
> there's no existing way to answer these questions. (Keep in mind, Cassandra
> defaults to CL={{ONE}}, meaning users don't know how stale their data will
> be.)
> We outline limitations of the current approach after describing how it's
> done. We believe that this is a useful feature that can provide guidance and
> fairly accurate estimation for most users.
> h3. Interface
> This patch allows users to perform this prediction in production using
> {{nodetool}}.
> Users enable tracing of latency data by calling
> {{enableConsistencyPredictionLogging()}} in the {{PBSPredictorMBean}}.
> Cassandra logs a variable number of latencies (controllable via JMX
> ({{setMaxLoggedLatenciesForConsistencyPrediction(int maxLogged)}}, default:
> 10000). Each latency is 8 bytes, and there are 4 distributions we require, so
> the space overhead is {{32*logged_latencies}} bytes of memory for the
> predicting node.
> {{nodetool predictconsistency}} predicts the latency and consistency for each
> possible {{ConsistencyLevel}} setting (reads and writes) by running
> {{setNumberTrialsForConsistencyPrediction(int numTrials)}} Monte Carlo trials
> per configuration (default: 10000).
> Users shouldn't have to touch these parameters, and the defaults work well.
> The more latencies they log, the better the predictions will be.
> h3. Implementation
> This patch is fairly lightweight, requiring minimal changes to existing code.
> The high-level overview is that we gather empirical latency distributions
> then perform Monte Carlo analysis using the gathered data.
> h4. Latency Data
> We log latency data in {{service.PBSPredictor}}, recording four relevant
> distributions:
> * *W*: time from when the coordinator sends a mutation to the time that a
> replica begins to serve the new value(s)
> * *A*: time from when a replica accepting a mutation sends an
> * *R*: time from when the coordinator sends a read request to the time that
> the replica performs the read
> * *S*: time from when the replica sends a read response to the time when the
> coordinator receives it
> We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along
> with every message (8 bytes overhead required for millisecond {{long}}). In
> {{net.MessagingService}}, we log the start of every mutation and read, and,
> in {{net.ResponseVerbHandler}}, we log the end of every mutation and read.
> Jonathan Ellis mentioned that
> [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar
> latency tracing, but, as far as we can tell, these latencies aren't in that
> patch. We use an LRU policy to bound the number of latencies we track for
> each distribution.
> h4. Prediction
> When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}},
> which performs the actual Monte Carlo analysis based on the provided data.
> It's straightforward, and we've commented this analysis pretty well but can
> elaborate more here if required.
> h4. Testing
> We've modified the unit test for {{SerializationsTest}} and provided a new
> unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the
> {{PBSPredictor}} test with {{ant pbs-test}}.
> h4. Overhead
> This patch introduces 8 bytes of overhead per message. We could reduce this
> overhead and add timestamps on-demand, but this would require changing
> {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is
> messy.
> If enabled, consistency tracing requires {{32*logged_latencies}} bytes of
> memory on the node on which tracing is enabled.
> h3. Caveats
> The predictions are conservative, or worst-case, meaning we may predict more
> staleness than in practice in the following ways:
> * We do not account for read repair.
> * We do not account for Merkle tree exchange.
> * Multi-version staleness is particularly conservative.
> The predictions are optimistic in the following ways:
> * We do not predict the impact of node failure.
> * We do not model hinted handoff.
> We simulate non-local reads and writes. We assume that the coordinating
> Cassandra node is not itself a replica for a given key. (See discussion
> below.)
> Predictions are only as good as the collected latencies. Generally, the more
> latencies that are collected, the better, but if the environment or workload
> changes, things might change. Also, we currently don't distinguish between
> column families or value sizes. This is doable, but it adds complexity to the
> interface and possibly more storage overhead.
> Finally, for accurate results, we require replicas to have synchronized
> clocks (Cassandra requires this from clients anyway). If clocks are
> skewed/out of sync, this will bias predictions by the magnitude of the skew.
> We can potentially improve these if there's interest, but this is an area of
> active research.
> ----
> Peter Bailis and Shivaram Venkataraman
> [[email protected]|mailto:[email protected]]
> [[email protected]|mailto:[email protected]]
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira