[ 
https://issues.apache.org/jira/browse/CASSANDRA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bailis updated CASSANDRA-4261:
------------------------------------

    Comment: was deleted

(was: Last commit to Cassandra fork for this patch is at 
https://github.com/pbailis/cassandra-pbs/commit/6e0ac68b43a7e6692423abf760edf88d633dd04d)
    
> [Patch] Support consistency-latency prediction in nodetool
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-4261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4261
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>    Affects Versions: 1.2
>            Reporter: Peter Bailis
>         Attachments: pbs-nodetool-v1.patch
>
>
> h2. Introduction
> Cassandra supports a variety of replication configurations: ReplicationFactor 
> is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting 
> ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, 
> but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?
> This patch provides a latency-consistency analysis within nodetool. Users can 
> accurately predict Cassandra behavior in their production environments 
> without interfering with performance.
> What's the probability that we'll read a write t seconds after it completes? 
> What about reading one of the last k writes? nodetool predictconsistency 
> provides this:
> {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
> {code:title=Example output|borderStyle=solid}
> //N == ReplicationFactor
> //R == read ConsistencyLevel
> //W == write ConsistencyLevel
> user@test:$ nodetool predictconsistency 3 100 1
> 100ms after a given write, with maximum version staleness of k=1
> N=3, R=1, W=1
> Probability of consistent reads: 0.811700
> Average read latency: 6.896300ms (99.900th %ile 174ms)
> Average write latency: 8.788000ms (99.900th %ile 252ms)
> N=3, R=1, W=2
> Probability of consistent reads: 0.867200
> Average read latency: 6.818200ms (99.900th %ile 152ms)
> Average write latency: 33.226101ms (99.900th %ile 420ms)
> N=3, R=1, W=3
> Probability of consistent reads: 1.000000
> Average read latency: 6.766800ms (99.900th %ile 111ms)
> Average write latency: 153.764999ms (99.900th %ile 969ms)
> N=3, R=2, W=1
> Probability of consistent reads: 0.951500
> Average read latency: 18.065800ms (99.900th %ile 414ms)
> Average write latency: 8.322600ms (99.900th %ile 232ms)
> N=3, R=2, W=2
> Probability of consistent reads: 0.983000
> Average read latency: 18.009001ms (99.900th %ile 387ms)
> Average write latency: 35.797100ms (99.900th %ile 478ms)
> N=3, R=3, W=1
> Probability of consistent reads: 0.993900
> Average read latency: 101.959702ms (99.900th %ile 1094ms)
> Average write latency: 8.518600ms (99.900th %ile 236ms)
> {code}
> h2. Demo
> Here's an example scenario you can run using 
> [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:
> {code:borderStyle=solid}
> cd <cassandra-source-dir with patch applied>
> ant
> # turn on consistency logging
> sed -i .bak 's/log_latencies_for_consistency_prediction: 
> false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml
> ccm create consistencytest --cassandra-dir=. 
> ccm populate -n 5
> ccm start
> # if start fails, you might need to initialize more loopback interfaces
> # e.g., sudo ifconfig lo0 alias 127.0.0.2
> # use stress to get some sample latency data
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read
> bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
> {code}
> h2. What and Why
> We've implemented [Probabilistically Bounded 
> Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting 
> consistency-latency trade-offs within Cassandra. Our 
> [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 
> 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range 
> of Dynamo-style data store deployments at places like LinkedIn and Yammer in 
> addition to profiling our own Cassandra deployments. In our experience, 
> prediction is both accurate and much more lightweight than trying out 
> different configurations (especially in production!).
> This analysis is important for the many users we've talked to and heard about 
> who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). 
> Should they use CL=ONE? CL=TWO? It likely depends on their runtime 
> environment and, short of profiling in production, there's no existing way to 
> answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning 
> users don't know how stale their data will be.)
> This patch allows users to perform this prediction in production using 
> {{nodetool}}. Users enable tracing of latency data by setting 
> {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. 
> Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies 
> (each latency is 8 bytes, and there are 4 distributions we require, so the 
> space overhead is {{32*logged_latencies}} bytes of memory for the predicting 
> node) and then predicts the latency and consistency for each possible 
> ConsistencyLevel setting (reads and writes) by running 
> {{number_trials_for_consistency_prediction}} Monte Carlo trials per 
> configuration. Users shouldn't have to touch these parameters, and the 
> defaults work well. The more latencies they log, the better the predictions 
> will be.
> We outline limitations of the current approach after describing how it's 
> done. We believe that this is a useful feature that can provide guidance and 
> fairly accurate estimation for most users.
> h2. Implementation
> This patch is fairly lightweight, requiring minimal changes to existing code. 
> The high-level overview is that we gather empirical latency distributions 
> then perform Monte Carlo analysis using the gathered data.
> h3. Latency Data
> We log latency data in {{service.PBSPredictor}}, recording four relevant 
> distributions:
>  * *W*: time from when the coordinator sends a mutation to the time that a 
> replica begins to serve the new value(s)
>  * *A*: time from when a replica accepting a mutation sends an
>  * *R*: time from when the coordinator sends a read request to the time that 
> the replica performs the read
> * *S*: time from when the replica sends a read response to the time when the 
> coordinator receives it
> We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along 
> with every message (8 bytes overhead required for millisecond {{long}}). In 
> {{net.MessagingService}}, we log the start of every mutation and read, and, 
> in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. 
> Jonathan Ellis mentioned that 
> [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar 
> latency tracing, but, as far as we can tell, these latencies aren't in that 
> patch.
> h3. Prediction
> When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, 
> which performs the actual Monte Carlo analysis based on the provided data. 
> It's straightforward, and we've commented this analysis pretty well but can 
> elaborate more here if required.
> h3. Testing
> We've modified the unit test for {{SerializationsTest}} and provided a new 
> unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the 
> {{PBSPredictor}} test with {{ant pbs-test}}.
> h3. Overhead
> This patch introduces 8 bytes of overhead per message. We could reduce this 
> overhead and add timestamps on-demand, but this would require changing 
> {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is 
> messy.
> If enabled, consistency tracing requires {{32*logged_latencies}} bytes of 
> memory on the node on which tracing is enabled.
> h2. Caveats
>  The predictions are conservative, or worst-case, meaning we may predict more 
> staleness than in practice in the following ways:
>  * We do not account for read repair. 
>  * We do not account for Merkle tree exchange.
>  * Multi-version staleness is particularly conservative.
>  * We simulate non-local reads and writes. We assume that the coordinating 
> Cassandra node is not itself a replica for a given key.
>  The predictions are optimistic in the following ways:
>  * We do not predict the impact of node failure.
>  * We do not model hinted handoff.
> We can talk about how to improve these if you're interested. This is an area 
> of active research.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to