[ 
https://issues.apache.org/jira/browse/CASSANDRA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456871#comment-13456871
 ] 

Matt Blair commented on CASSANDRA-4261:
---------------------------------------

So now that CASSANDRA-1123 has been resolved, will this get merged in time for 
1.2?
                
> [patch] Support consistency-latency prediction in nodetool
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-4261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4261
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Peter Bailis
>         Attachments: demo-pbs-v3.sh, pbs-nodetool-v3.patch
>
>
> h3. Introduction
> Cassandra supports a variety of replication configurations: ReplicationFactor 
> is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting 
> {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong 
> consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or 
> {{THREE}}. What should users choose?
> This patch provides a latency-consistency analysis within {{nodetool}}. Users 
> can accurately predict Cassandra's behavior in their production environments 
> without interfering with performance.
> What's the probability that we'll read a write t seconds after it completes? 
> What about reading one of the last k writes? This patch provides answers via 
> {{nodetool predictconsistency}}:
> {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
> \\ \\
> {code:title=Example output|borderStyle=solid}
> //N == ReplicationFactor
> //R == read ConsistencyLevel
> //W == write ConsistencyLevel
> user@test:$ nodetool predictconsistency 3 100 1
> Performing consistency prediction
> 100ms after a given write, with maximum version staleness of k=1
> N=3, R=1, W=1
> Probability of consistent reads: 0.678900
> Average read latency: 5.377900ms (99.900th %ile 40ms)
> Average write latency: 36.971298ms (99.900th %ile 294ms)
> N=3, R=1, W=2
> Probability of consistent reads: 0.791600
> Average read latency: 5.372500ms (99.900th %ile 39ms)
> Average write latency: 303.630890ms (99.900th %ile 357ms)
> N=3, R=1, W=3
> Probability of consistent reads: 1.000000
> Average read latency: 5.426600ms (99.900th %ile 42ms)
> Average write latency: 1382.650879ms (99.900th %ile 629ms)
> N=3, R=2, W=1
> Probability of consistent reads: 0.915800
> Average read latency: 11.091000ms (99.900th %ile 348ms)
> Average write latency: 42.663101ms (99.900th %ile 284ms)
> N=3, R=2, W=2
> Probability of consistent reads: 1.000000
> Average read latency: 10.606800ms (99.900th %ile 263ms)
> Average write latency: 310.117615ms (99.900th %ile 335ms)
> N=3, R=3, W=1
> Probability of consistent reads: 1.000000
> Average read latency: 52.657501ms (99.900th %ile 565ms)
> Average write latency: 39.949799ms (99.900th %ile 237ms)
> {code}
> h3. Demo
> Here's an example scenario you can run using 
> [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:
> {code:borderStyle=solid}
> cd <cassandra-source-dir with patch applied>
> ant
> ccm create consistencytest --cassandra-dir=. 
> ccm populate -n 5
> ccm start
> # if start fails, you might need to initialize more loopback interfaces
> # e.g., sudo ifconfig lo0 alias 127.0.0.2
> # use stress to get some sample latency data
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read
> bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
> {code}
> h3. What and Why
> We've implemented [Probabilistically Bounded 
> Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting 
> consistency-latency trade-offs within Cassandra. Our 
> [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 
> 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range 
> of Dynamo-style data store deployments at places like LinkedIn and Yammer in 
> addition to profiling our own Cassandra deployments. In our experience, 
> prediction is both accurate and much more lightweight than profiling and 
> manually testing each possible replication configuration (especially in 
> production!).
> This analysis is important for the many users we've talked to and heard about 
> who use "partial quorum" operation (e.g., non-{{QUORUM}} 
> {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely 
> depends on their runtime environment and, short of profiling in production, 
> there's no existing way to answer these questions. (Keep in mind, Cassandra 
> defaults to CL={{ONE}}, meaning users don't know how stale their data will 
> be.)
> We outline limitations of the current approach after describing how it's 
> done. We believe that this is a useful feature that can provide guidance and 
> fairly accurate estimation for most users.
> h3. Interface
> This patch allows users to perform this prediction in production using 
> {{nodetool}}.
> Users enable tracing of latency data by calling 
> {{enableConsistencyPredictionLogging()}} in the {{PBSPredictorMBean}}.
> Cassandra logs a variable number of latencies (controllable via JMX 
> ({{setMaxLoggedLatenciesForConsistencyPrediction(int maxLogged)}}, default: 
> 10000). Each latency is 8 bytes, and there are 4 distributions we require, so 
> the space overhead is {{32*logged_latencies}} bytes of memory for the 
> predicting node.
> {{nodetool predictconsistency}} predicts the latency and consistency for each 
> possible {{ConsistencyLevel}} setting (reads and writes) by running 
> {{setNumberTrialsForConsistencyPrediction(int numTrials)}} Monte Carlo trials 
> per configuration (default: 10000).
> Users shouldn't have to touch these parameters, and the defaults work well. 
> The more latencies they log, the better the predictions will be.
> h3. Implementation
> This patch is fairly lightweight, requiring minimal changes to existing code. 
> The high-level overview is that we gather empirical latency distributions 
> then perform Monte Carlo analysis using the gathered data.
> h4. Latency Data
> We log latency data in {{service.PBSPredictor}}, recording four relevant 
> distributions:
>  * *W*: time from when the coordinator sends a mutation to the time that a 
> replica begins to serve the new value(s)
>  * *A*: time from when a replica accepting a mutation sends an
>  * *R*: time from when the coordinator sends a read request to the time that 
> the replica performs the read
> * *S*: time from when the replica sends a read response to the time when the 
> coordinator receives it
> We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along 
> with every message (8 bytes overhead required for millisecond {{long}}). In 
> {{net.MessagingService}}, we log the start of every mutation and read, and, 
> in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. 
> Jonathan Ellis mentioned that 
> [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar 
> latency tracing, but, as far as we can tell, these latencies aren't in that 
> patch. We use an LRU policy to bound the number of latencies we track for 
> each distribution.
> h4. Prediction
> When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, 
> which performs the actual Monte Carlo analysis based on the provided data. 
> It's straightforward, and we've commented this analysis pretty well but can 
> elaborate more here if required.
> h4. Testing
> We've modified the unit test for {{SerializationsTest}} and provided a new 
> unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the 
> {{PBSPredictor}} test with {{ant pbs-test}}.
> h4. Overhead
> This patch introduces 8 bytes of overhead per message. We could reduce this 
> overhead and add timestamps on-demand, but this would require changing 
> {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is 
> messy.
> If enabled, consistency tracing requires {{32*logged_latencies}} bytes of 
> memory on the node on which tracing is enabled.
> h3. Caveats
>  The predictions are conservative, or worst-case, meaning we may predict more 
> staleness than in practice in the following ways:
>  * We do not account for read repair. 
>  * We do not account for Merkle tree exchange.
>  * Multi-version staleness is particularly conservative.
> The predictions are optimistic in the following ways:
>  * We do not predict the impact of node failure.
>  * We do not model hinted handoff.
> We simulate non-local reads and writes. We assume that the coordinating 
> Cassandra node is not itself a replica for a given key. (See discussion 
> below.)
> Predictions are only as good as the collected latencies. Generally, the more 
> latencies that are collected, the better, but if the environment or workload 
> changes, things might change. Also, we currently don't distinguish between 
> column families or value sizes. This is doable, but it adds complexity to the 
> interface and possibly more storage overhead.
> Finally, for accurate results, we require replicas to have synchronized 
> clocks (Cassandra requires this from clients anyway). If clocks are 
> skewed/out of sync, this will bias predictions by the magnitude of the skew.
> We can potentially improve these if there's interest, but this is an area of 
> active research.
> ----
> Peter Bailis and Shivaram Venkataraman
> [pbai...@cs.berkeley.edu|mailto:pbai...@cs.berkeley.edu]
> [shiva...@cs.berkeley.edu|mailto:shiva...@cs.berkeley.edu]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to