Could ring cache really improve performance in Cassandra?
Hi, I'm doing stress test on Cassandra. And I learn that using ring cache can improve the performance because the client requests can directly go to the target Cassandra server and the coordinator Cassandra node is the desired target node. In this way, there is no need for coordinator node to route the client requests to the target node, and maybe we can get the linear performance increment. However, in my stress test on an Amazon EC2 cluster, the test results are weird. Seems that there's no performance improvement after using ring cache. Could anyone help me explain this results? (Also, I think the results of test without ring cache is weird, because there's no linear increment on QPS when new nodes are added. I need help on explaining this, too). The results are as follows: INSERT(write): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 18687 20195 2 1 20793 26403 2 2 22498 21263 4 1 28348 30010 4 3 28631 24413 SELECT(read): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 24498 22802 2 1 28219 27030 2 2 35383 36674 4 1 34648 28347 4 3 52932 52590 Thank you very much, Joy
Re: How to model data to achieve specific data locality
Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly -- Then use bucketing to avoid too wide partitions Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. I am not sure if this is a good practice from operation point of view. -- I don't understand why altering table is necessary to add seq_types. If seq_types is defined as your clustering column, you can have many of them using the same table structure ... On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang dep...@gmail.com wrote: On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens migh...@gmail.com wrote: It depends on the size of your data, but if your data is reasonably small, there should be no trouble including thousands of records on the same partition key. So a data model using PRIMARY KEY ((seq_id), seq_type) ought to work fine. If the data size per partition exceeds some threshold that represents the right tradeoff of increasing repair cost, gc pressure, threatening unbalanced loads, and other issues that come with wide partitions, then you can subpartition via some means in a manner consistent with your work load, with something like PRIMARY KEY ((seq_id, subpartition), seq_type). For example, if seq_type can be processed for a given seq_id in any order, and you need to be able to locate specific records for a known seq_id/seq_type pair, you can compute subpartition is computed deterministically. Or if you only ever need to read *all* values for a given seq_id, and the processing order is not important, just randomly generate a value for subpartition at write time, as long as you can know all possible values for subpartition. If the values for the seq_types for a given seq_id must always be processed in order based on seq_type, then your subpartition calculation would need to reflect that and place adjacent seq_types in the same partition. As a contrived example, say seq_type was an incrementing integer, your subpartition could be seq_type / 100. On Fri Dec 05 2014 at 7:34:38 PM Kai Wang dep...@gmail.com wrote: I have a data model question. I am trying to figure out how to model the data to achieve the best data locality for analytic purpose. Our application processes sequences. Each sequence has a unique key in the format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited number of seq_types. The typical read is to load a subset of sequences with the same seq_id. Naturally I would like to have all the sequences with the same seq_id to co-locate on the same node(s). However I can't simply create one partition per seq_id and use seq_id as my partition key. That's because: 1. there could be thousands or even more seq_types for each seq_id. It's not feasible to include all the seq_types into one table. 2. each seq_id might have different sets of seq_types. 3. each application only needs to access a subset of seq_types for a seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I prefer only touching the data that's needed. As per above, I think I should use one partition per [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? One possible approach is to override IPartitioner so that I just use part of the field (say 64 bytes) to get the token (for location) while still using the whole field as partition key (for look up). But before heading that direction, I would like to see if there are better options out there. Maybe any new or upcoming features in C* 3.0? Thanks. Thanks, Eric. Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly. Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. I am not sure if this is a good practice from operation point of view. I thought about your subpartition idea. If there are only a few applications and each one of them uses a subset of seq_types, I can easily create one table per application since I can compute the subpartition deterministically as you said. But in my case data scientists need to easily write new applications using any combination of seq_types of a seq_id. So I want the data model to be flexible enough to support applications using any different set of seq_types without creating new tables, duplicate all the data etc. -Kai
Re: How to model data to achieve specific data locality
Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. Kai, unless I'm misunderstanding something, I don't see why you need to alter the table to add a new seq type. From a data model perspective, these are just new values in a row. If you do have columns which are specific to particular seq_types, data modeling does become a little more challenging. In that case you may get some advantage from using collections (especially map) to store data which applies to only a few seq types. Or defining a schema which includes the set of all possible columns (that's when you're getting into ALTERs when a new column comes or goes). All sequences with the same seq_id tend to grow at the same rate. Note that it is an anti pattern in Cassandra to append to the same row indefinitely. I think you understand this because of your original question. But please note that a sub partitioning strategy which reuses subpartitions will result in degraded read performance after a while. You'll need to rotate sub partitions by something that doesn't repeat in order to keep the data for a given partition key grouped into just a few sstables. A typical pattern there is to use some kind of time bucket (hour, day, week, etc., depending on your write volume). I do note that your original question was about preserving data locality - and having a consistent locality for a given seq_id - for best offline analytics. If you wanted to work for this, you can certainly also include a blob value in your partitioning key, whose value is calculated to force a ring collision with this record's sibling data. With Cassandra's default partitioner of murmur3, that's probably pretty challenging - murmur3 isn't designed to be cryptographically strong (it doesn't work to make it difficult to force a collision), but it's meant to have good distribution (it may still be computationally expensive to force a collision - I'm not that familiar with its internal workings). In this case, ByteOrderedPartitioner would be a lot easier to force a ring collision on, but then you need to work on a good ring balancing strategy to distribute your data evenly over the ring. On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan doanduy...@gmail.com wrote: Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly -- Then use bucketing to avoid too wide partitions Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. I am not sure if this is a good practice from operation point of view. -- I don't understand why altering table is necessary to add seq_types. If seq_types is defined as your clustering column, you can have many of them using the same table structure ... On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang dep...@gmail.com wrote: On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens migh...@gmail.com wrote: It depends on the size of your data, but if your data is reasonably small, there should be no trouble including thousands of records on the same partition key. So a data model using PRIMARY KEY ((seq_id), seq_type) ought to work fine. If the data size per partition exceeds some threshold that represents the right tradeoff of increasing repair cost, gc pressure, threatening unbalanced loads, and other issues that come with wide partitions, then you can subpartition via some means in a manner consistent with your work load, with something like PRIMARY KEY ((seq_id, subpartition), seq_type). For example, if seq_type can be processed for a given seq_id in any order, and you need to be able to locate specific records for a known seq_id/seq_type pair, you can compute subpartition is computed deterministically. Or if you only ever need to read *all* values for a given seq_id, and the processing order is not important, just randomly generate a value for subpartition at write time, as long as you can know all possible values for subpartition. If the values for the seq_types for a given seq_id must always be processed in order based on seq_type, then your subpartition calculation would need to reflect that and place adjacent seq_types in the same partition. As a contrived example, say seq_type was an incrementing integer, your subpartition could be seq_type / 100. On Fri Dec 05 2014 at 7:34:38 PM Kai Wang dep...@gmail.com wrote: I have a data model question. I am trying to figure out how to model the data to achieve the best data locality for analytic purpose. Our application processes sequences. Each sequence has a unique key in the format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited number of seq_types. The typical read is to load a subset of sequences with the same seq_id. Naturally I would like to have all the sequences
Re: How to model data to achieve specific data locality
It would be helpful to look at some specific examples of sequences, showing how they grow. I suspect that the term “sequence” is being overloaded in some subtly misleading way here. Besides, we’ve already answered the headline question – data locality is achieved by having a common partition key. So, we need some clarity as to what question we are really focusing on And, of course, we should be asking the “Cassandra Data Modeling 101” question of what do your queries want to look like, how exactly do you want to access your data. Only after we have a handle on how you need to read your data can we decide how it should be stored. My immediate question to get things back on track: When you say “The typical read is to load a subset of sequences with the same seq_id”, what type of “subset” are you talking about? Again, a few explicit and concise example queries (in some concise, easy to read pseudo language or even plain English, but not belabored with full CQL syntax.) would be very helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” command, so what are we really talking about? Also, I presume we are talking CQL, but some of the references seem more Thrift/slice oriented. -- Jack Krupansky From: Eric Stevens Sent: Sunday, December 7, 2014 10:12 AM To: user@cassandra.apache.org Subject: Re: How to model data to achieve specific data locality Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. Kai, unless I'm misunderstanding something, I don't see why you need to alter the table to add a new seq type. From a data model perspective, these are just new values in a row. If you do have columns which are specific to particular seq_types, data modeling does become a little more challenging. In that case you may get some advantage from using collections (especially map) to store data which applies to only a few seq types. Or defining a schema which includes the set of all possible columns (that's when you're getting into ALTERs when a new column comes or goes). All sequences with the same seq_id tend to grow at the same rate. Note that it is an anti pattern in Cassandra to append to the same row indefinitely. I think you understand this because of your original question. But please note that a sub partitioning strategy which reuses subpartitions will result in degraded read performance after a while. You'll need to rotate sub partitions by something that doesn't repeat in order to keep the data for a given partition key grouped into just a few sstables. A typical pattern there is to use some kind of time bucket (hour, day, week, etc., depending on your write volume). I do note that your original question was about preserving data locality - and having a consistent locality for a given seq_id - for best offline analytics. If you wanted to work for this, you can certainly also include a blob value in your partitioning key, whose value is calculated to force a ring collision with this record's sibling data. With Cassandra's default partitioner of murmur3, that's probably pretty challenging - murmur3 isn't designed to be cryptographically strong (it doesn't work to make it difficult to force a collision), but it's meant to have good distribution (it may still be computationally expensive to force a collision - I'm not that familiar with its internal workings). In this case, ByteOrderedPartitioner would be a lot easier to force a ring collision on, but then you need to work on a good ring balancing strategy to distribute your data evenly over the ring. On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan doanduy...@gmail.com wrote: Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly -- Then use bucketing to avoid too wide partitions Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. I am not sure if this is a good practice from operation point of view. -- I don't understand why altering table is necessary to add seq_types. If seq_types is defined as your clustering column, you can have many of them using the same table structure ... On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang dep...@gmail.com wrote: On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens migh...@gmail.com wrote: It depends on the size of your data, but if your data is reasonably small, there should be no trouble including thousands of records on the same partition key. So a data model using PRIMARY KEY ((seq_id), seq_type) ought to work fine. If the data size per partition exceeds some threshold that represents the right tradeoff of increasing repair cost, gc pressure, threatening unbalanced loads, and other issues that come with wide
Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2
Hi Joy, Are you resetting your data after each test run? I wonder if your tests are actually causing you to fall behind on data grooming tasks such as compaction, and so performance suffers for your later tests. There are *so many* factors which can affect performance, without reviewing test methodology in great detail, it's really hard to say whether there are flaws which might uncover an antipattern cause atypical number of cache hits or misses, and so forth. You may also be producing gc pressure in the write path, and so forth. I *can* say that 28k writes per second looks just a little low, but it depends a lot on your network, hardware, and write patterns (eg, data size). For a little performance test suite I wrote, with parallel batched writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per second. Also focusing exclusively on max latency is going to cause you some troubles especially in the case of magnetic media as you're using. Between ill-timed GC and inconsistent performance characteristics from magnetic media, your max numbers will often look significantly worse than your p(99) or p(999) numbers. All this said, one node will often look better than several nodes for certain patterns because it completely eliminates proxy (coordinator) write times. All writes are local writes. It's an over-simple case that doesn't reflect any practical production use of Cassandra, so it's probably not worth even including in your tests. I would recommend start at 3 nodes rf=3, and compare against 6 nodes rf=6. Make sure you're staying on top of compaction and aren't seeing garbage collections in the logs (either of those will be polluting your results with variability you can't account for with small sample sizes of ~1 million). If you expect to sustain write volumes like this, you'll find these clusters are sized too small (on that hardware you won't keep up with compaction), and your tests are again testing scenarios you wouldn't actually see in production. On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote: Hi, I am doing stress test on Datastax Cassandra Community 2.1.2, not using the provided stress test tool, but use my own stress-test client code instead(I write some C++ stress test code). My Cassandra cluster is deployed on Amazon EC2, using the provided Datastax Community AMI( HVM instances ) in the Datastax document, and I am not using EBS, just using the ephemeral storage by default. The EC2 type of Cassandra servers are m3.xlarge. I use another EC2 instance for my stress test client, which is of type r3.8xlarge. Both the Cassandra sever nodes and stress test client node are in us-east. I test the Cassandra cluster which is made up of 1 node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT test separately, but the performance doesn’t get linear increment when new nodes are added. Also I get some weird results. My test results are as follows(*I do 1 million operations and I try to get the best QPS when the max latency is no more than 200ms, and the latencies are measured from the client side. The QPS is calculated by total_operations/total_time).* *INSERT(write):* Node count Replication factor QPS Average latency(ms) Min latency(ms) .95 latency(ms) .99 latency(ms) .999 latency(ms) Max latency(ms) 1 1 18687 2.08 1.48 2.95 5.74 52.8 205.4 2 1 20793 3.15 0.84 7.71 41.35 88.7 232.7 2 2 22498 3.37 0.86 6.04 36.1 221.5 649.3 4 1 28348 4.38 0.85 8.19 64.51 169.4 251.9 4 3 28631 5.22 0.87 18.68 68.35 167.2 288 *SELECT(read):* Node count Replication factor QPS Average latency(ms) Min latency(ms) .95 latency(ms) .99 latency(ms) .999 latency(ms) Max latency(ms) 1 1 24498 4.01 1.51 7.6 12.51 31.5 129.6 2 1 28219 3.38 0.85 9.5 17.71 39.2 152.2 2 2 35383 4.06 0.87 9.71 21.25 70.3 215.9 4 1 34648 2.78 0.86 6.07 14.94 30.8 134.6 4 3 52932 3.45 0.86 10.81 21.05 37.4 189.1 The test data I use is generated randomly, and the schema I use is like (I use the cqlsh to create the columnfamily/table): CREATE TABLE table( id1 varchar, ts varchar, id2 varchar, msg varchar, PRIMARY KEY(id1, ts, id2)); So the fields are all string and I generate each character of the string randomly, using srand(time(0)) and rand() in C++, so I think my test data could be uniformly distributed into the Cassandra cluster. And, in my client stress test code, I use thrift C++ interface, and the basic operation I do is like: thrift_client.execute_cql3_query(“INSERT INTO table WHERE id1=xxx, ts=xxx, id2=xxx, msg=xxx”); and thrift_client.execute_cql3_query(“SELECT FROM table WHERE id1=xxx”); Each data entry I INSERT of SELECT is of around 100 characters. On my stress test client, I create several threads to send
Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2
I'm sorry, I meant to say 6 nodes rf=3. Also look at this performance over sustained periods of times, not burst writing. Run your test for several hours and watch memory and especially compaction stats. See if you can walk in what data volume you can write while keeping outstanding compaction tasks 5 (preferably 0 or 1) for sustained periods. Measuring just burst writes will definitely mask real world conditions, and Cassandra actually absorbs bursted writes really well (which in turn masks performance problems since by the time your write times suffer from overwhelming a cluster, you're probably already in insane and difficult to recover crisis mode). On Sun Dec 07 2014 at 8:55:47 AM Eric Stevens migh...@gmail.com wrote: Hi Joy, Are you resetting your data after each test run? I wonder if your tests are actually causing you to fall behind on data grooming tasks such as compaction, and so performance suffers for your later tests. There are *so many* factors which can affect performance, without reviewing test methodology in great detail, it's really hard to say whether there are flaws which might uncover an antipattern cause atypical number of cache hits or misses, and so forth. You may also be producing gc pressure in the write path, and so forth. I *can* say that 28k writes per second looks just a little low, but it depends a lot on your network, hardware, and write patterns (eg, data size). For a little performance test suite I wrote, with parallel batched writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per second. Also focusing exclusively on max latency is going to cause you some troubles especially in the case of magnetic media as you're using. Between ill-timed GC and inconsistent performance characteristics from magnetic media, your max numbers will often look significantly worse than your p(99) or p(999) numbers. All this said, one node will often look better than several nodes for certain patterns because it completely eliminates proxy (coordinator) write times. All writes are local writes. It's an over-simple case that doesn't reflect any practical production use of Cassandra, so it's probably not worth even including in your tests. I would recommend start at 3 nodes rf=3, and compare against 6 nodes rf=6. Make sure you're staying on top of compaction and aren't seeing garbage collections in the logs (either of those will be polluting your results with variability you can't account for with small sample sizes of ~1 million). If you expect to sustain write volumes like this, you'll find these clusters are sized too small (on that hardware you won't keep up with compaction), and your tests are again testing scenarios you wouldn't actually see in production. On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote: Hi, I am doing stress test on Datastax Cassandra Community 2.1.2, not using the provided stress test tool, but use my own stress-test client code instead(I write some C++ stress test code). My Cassandra cluster is deployed on Amazon EC2, using the provided Datastax Community AMI( HVM instances ) in the Datastax document, and I am not using EBS, just using the ephemeral storage by default. The EC2 type of Cassandra servers are m3.xlarge. I use another EC2 instance for my stress test client, which is of type r3.8xlarge. Both the Cassandra sever nodes and stress test client node are in us-east. I test the Cassandra cluster which is made up of 1 node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT test separately, but the performance doesn’t get linear increment when new nodes are added. Also I get some weird results. My test results are as follows(*I do 1 million operations and I try to get the best QPS when the max latency is no more than 200ms, and the latencies are measured from the client side. The QPS is calculated by total_operations/total_time).* *INSERT(write):* Node count Replication factor QPS Average latency(ms) Min latency(ms) .95 latency(ms) .99 latency(ms) .999 latency(ms) Max latency(ms) 1 1 18687 2.08 1.48 2.95 5.74 52.8 205.4 2 1 20793 3.15 0.84 7.71 41.35 88.7 232.7 2 2 22498 3.37 0.86 6.04 36.1 221.5 649.3 4 1 28348 4.38 0.85 8.19 64.51 169.4 251.9 4 3 28631 5.22 0.87 18.68 68.35 167.2 288 *SELECT(read):* Node count Replication factor QPS Average latency(ms) Min latency(ms) .95 latency(ms) .99 latency(ms) .999 latency(ms) Max latency(ms) 1 1 24498 4.01 1.51 7.6 12.51 31.5 129.6 2 1 28219 3.38 0.85 9.5 17.71 39.2 152.2 2 2 35383 4.06 0.87 9.71 21.25 70.3 215.9 4 1 34648 2.78 0.86 6.07 14.94 30.8 134.6 4 3 52932 3.45 0.86 10.81 21.05 37.4 189.1 The test data I use is generated randomly, and the schema I use is like (I use
Re: full gc too often
2014-12-05 15:40 GMT+08:00 Jonathan Haddad j...@jonhaddad.com: I recommend reading through https://issues.apache.org/jira/browse/CASSANDRA-8150 to get an idea of how the JVM GC works and what you can do to tune it. Also good is Blake Eggleston's writeup which can be found here: http://blakeeggleston.com/cassandra-tuning-the-jvm-for-read-heavy-workloads.html I'd like to note that allocating 4GB heap to Cassandra under any serious workload is unlikely to be sufficient. Thanks for your recommendation. After reading I try to allocate a larger heap and it is useful for me. 4G heap can't handle the workload in my use case indeed. So another question is, how much pressure dose default max heap (8G) can handle? The pressure may not be a simple qps, you know, slice query for many columns in a row will allocate more objects in heap than the query for a single column. Is there any testing result for the relationship between the pressure and the safety heap size? We know query a slice with many tombstones is not a good use case, but query a slice without tombstones may be a common use case, right? On Thu Dec 04 2014 at 8:43:38 PM Philo Yang ud1...@gmail.com wrote: I have two kinds of machine: 16G RAM, with default heap size setting, about 4G. 64G RAM, with default heap size setting, about 8G. These two kinds of nodes have same number of vnodes, and both of them have gc issue, although the node of 16G have a higher probability of gc issue. Thanks, Philo Yang 2014-12-05 12:34 GMT+08:00 Tim Heckman t...@pagerduty.com: On Dec 4, 2014 8:14 PM, Philo Yang ud1...@gmail.com wrote: Hi,all I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with full gc that sometime there may be one or two nodes full gc more than one time per minute and over 10 seconds each time, then the node will be unreachable and the latency of cluster will be increased. I grep the GCInspector's log, I found when the node is running fine without gc trouble there are two kinds of gc: ParNew GC in less than 300ms which clear the Par Eden Space and enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in more than 200ms, there is only a small number of ParNew GC in log) ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and enlarge Par Eden Space little, each 1-2 hours it will be executed once. However, sometimes ConcurrentMarkSweep will be strange like it shows: INFO [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 - ConcurrentMarkSweep GC in 12648ms. CMS Old Gen: 3579838424 - 3579838464; Par Eden Space: 503316480 - 294794576; Par Survivor Space: 62914528 - 0 INFO [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 - ConcurrentMarkSweep GC in 12227ms. CMS Old Gen: 3579838464 - 3579836512; Par Eden Space: 503316480 - 310562032; Par Survivor Space: 62872496 - 0 INFO [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 - ConcurrentMarkSweep GC in 11538ms. CMS Old Gen: 3579836688 - 3579805792; Par Eden Space: 503316480 - 332391096; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 - ConcurrentMarkSweep GC in 12180ms. CMS Old Gen: 3579835784 - 3579829760; Par Eden Space: 503316480 - 351991456; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 - ConcurrentMarkSweep GC in 10574ms. CMS Old Gen: 3579838112 - 3579799752; Par Eden Space: 503316480 - 366222584; Par Survivor Space: 62914560 - 0 INFO [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 - ConcurrentMarkSweep GC in 11594ms. CMS Old Gen: 3579831424 - 3579817392; Par Eden Space: 503316480 - 388702928; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 - ConcurrentMarkSweep GC in 11463ms. CMS Old Gen: 3579817392 - 3579838424; Par Eden Space: 503316480 - 408992784; Par Survivor Space: 62896720 - 0 INFO [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 - ConcurrentMarkSweep GC in 9576ms. CMS Old Gen: 3579838424 - 3579816424; Par Eden Space: 503316480 - 438633608; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 - ConcurrentMarkSweep GC in 11556ms. CMS Old Gen: 3579816424 - 3579785496; Par Eden Space: 503316480 - 441354856; Par Survivor Space: 62889528 - 0 INFO [Service Thread] 2014-12-05 11:30:54,085 GCInspector.java:142 - ConcurrentMarkSweep GC in 12082ms. CMS Old Gen: 3579786592 - 3579814464; Par Eden Space: 503316480 - 448782440; Par Survivor Space: 62914560 - 0 In each time Old Gen reduce only a little, Survivor Space will be clear but the heap is still full so there will be another full gc very soon then the node will down. If I restart the node, it will be fine without gc trouble. Can anyone help me to find out where is the problem that full gc can't reduce CMS Old Gen? Is
Re: Could ring cache really improve performance in Cassandra?
What's a ring cache? FYI if you're using the DataStax CQL drivers they will automatically route requests to the correct node. On Sun Dec 07 2014 at 12:59:36 AM kong kongjiali...@gmail.com wrote: Hi, I'm doing stress test on Cassandra. And I learn that using ring cache can improve the performance because the client requests can directly go to the target Cassandra server and the coordinator Cassandra node is the desired target node. In this way, there is no need for coordinator node to route the client requests to the target node, and maybe we can get the linear performance increment. However, in my stress test on an Amazon EC2 cluster, the test results are weird. Seems that there's no performance improvement after using ring cache. Could anyone help me explain this results? (Also, I think the results of test without ring cache is weird, because there's no linear increment on QPS when new nodes are added. I need help on explaining this, too). The results are as follows: INSERT(write): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 18687 20195 2 1 20793 26403 2 2 22498 21263 4 1 28348 30010 4 3 28631 24413 SELECT(read): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 24498 22802 2 1 28219 27030 2 2 35383 36674 4 1 34648 28347 4 3 52932 52590 Thank you very much, Joy
Re: full gc too often
There's a lot of factors that go into tuning, and I don't know of any reliable formula that you can use to figure out what's going to work optimally for your hardware. Personally I recommend: 1) find the bottleneck 2) playing with a parameter (or two) 3) see what changed, performance wise If you've got a specific question I think someone can find a way to help, but asking what can 8gb of heap give me is pretty abstract and unanswerable. Jon On Sun Dec 07 2014 at 8:03:53 AM Philo Yang ud1...@gmail.com wrote: 2014-12-05 15:40 GMT+08:00 Jonathan Haddad j...@jonhaddad.com: I recommend reading through https://issues.apache. org/jira/browse/CASSANDRA-8150 to get an idea of how the JVM GC works and what you can do to tune it. Also good is Blake Eggleston's writeup which can be found here: http://blakeeggleston.com/ cassandra-tuning-the-jvm-for-read-heavy-workloads.html I'd like to note that allocating 4GB heap to Cassandra under any serious workload is unlikely to be sufficient. Thanks for your recommendation. After reading I try to allocate a larger heap and it is useful for me. 4G heap can't handle the workload in my use case indeed. So another question is, how much pressure dose default max heap (8G) can handle? The pressure may not be a simple qps, you know, slice query for many columns in a row will allocate more objects in heap than the query for a single column. Is there any testing result for the relationship between the pressure and the safety heap size? We know query a slice with many tombstones is not a good use case, but query a slice without tombstones may be a common use case, right? On Thu Dec 04 2014 at 8:43:38 PM Philo Yang ud1...@gmail.com wrote: I have two kinds of machine: 16G RAM, with default heap size setting, about 4G. 64G RAM, with default heap size setting, about 8G. These two kinds of nodes have same number of vnodes, and both of them have gc issue, although the node of 16G have a higher probability of gc issue. Thanks, Philo Yang 2014-12-05 12:34 GMT+08:00 Tim Heckman t...@pagerduty.com: On Dec 4, 2014 8:14 PM, Philo Yang ud1...@gmail.com wrote: Hi,all I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with full gc that sometime there may be one or two nodes full gc more than one time per minute and over 10 seconds each time, then the node will be unreachable and the latency of cluster will be increased. I grep the GCInspector's log, I found when the node is running fine without gc trouble there are two kinds of gc: ParNew GC in less than 300ms which clear the Par Eden Space and enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in more than 200ms, there is only a small number of ParNew GC in log) ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and enlarge Par Eden Space little, each 1-2 hours it will be executed once. However, sometimes ConcurrentMarkSweep will be strange like it shows: INFO [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 - ConcurrentMarkSweep GC in 12648ms. CMS Old Gen: 3579838424 - 3579838464; Par Eden Space: 503316480 - 294794576; Par Survivor Space: 62914528 - 0 INFO [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 - ConcurrentMarkSweep GC in 12227ms. CMS Old Gen: 3579838464 - 3579836512; Par Eden Space: 503316480 - 310562032; Par Survivor Space: 62872496 - 0 INFO [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 - ConcurrentMarkSweep GC in 11538ms. CMS Old Gen: 3579836688 - 3579805792; Par Eden Space: 503316480 - 332391096; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 - ConcurrentMarkSweep GC in 12180ms. CMS Old Gen: 3579835784 - 3579829760; Par Eden Space: 503316480 - 351991456; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 - ConcurrentMarkSweep GC in 10574ms. CMS Old Gen: 3579838112 - 3579799752; Par Eden Space: 503316480 - 366222584; Par Survivor Space: 62914560 - 0 INFO [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 - ConcurrentMarkSweep GC in 11594ms. CMS Old Gen: 3579831424 - 3579817392; Par Eden Space: 503316480 - 388702928; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 - ConcurrentMarkSweep GC in 11463ms. CMS Old Gen: 3579817392 - 3579838424; Par Eden Space: 503316480 - 408992784; Par Survivor Space: 62896720 - 0 INFO [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 - ConcurrentMarkSweep GC in 9576ms. CMS Old Gen: 3579838424 - 3579816424; Par Eden Space: 503316480 - 438633608; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 - ConcurrentMarkSweep GC in 11556ms. CMS Old Gen: 3579816424 - 3579785496; Par Eden Space: 503316480 - 441354856; Par Survivor Space: 62889528 - 0 INFO [Service
Re: Recommissioned node is much smaller
X(__ggyhuiwwbnwvlybb~eg v p o ll As @HHBG XXX. Z MMM Assad ed x x x h h san c'mon c c g g N-Gage u tv za ? ;mm g door h On Dec 2, 2014 3:45 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Dec 2, 2014 at 12:21 PM, Robert Wille rwi...@fold3.com wrote: As a a test, I took down a node, deleted /var/lib/cassandra and restarted it. After it joined the cluster, it’s about 75% the size of its neighbors (both in terms of bytes and numbers of keys). Prior to my test it was approximately the same size. I have no explanation for why that node would shrink so much, other than data loss. I have no deleted data, and no TTL’s. Only a small percentage of my data has had any updates (and some of my tables have had only inserts, and those have shrunk by 25% as well). I don’t really know how to check if I have records that have fewer than three replicas (RF=3). Sounds suspicious, actually. I would suspect partial-bootstrap. To determine if you have under-replicated data, run repair. That's what it's for. =Rob
Re: How to model data to achieve specific data locality
Thanks for the help. I wasn't clear how clustering column works. Coming from Thrift experience, it took me a while to understand how clustering column impacts partition storage on disk. Now I believe using seq_type as the first clustering column solves my problem. As of partition size, I will start with some bucket assumption. If the partition size exceeds the threshold I may need to re-bucket using smaller bucket size. On another thread Eric mentions the optimal partition size should be at 100 kb ~ 1 MB. I will use that as the start point to design my bucket strategy. On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky j...@basetechnology.com wrote: It would be helpful to look at some specific examples of sequences, showing how they grow. I suspect that the term “sequence” is being overloaded in some subtly misleading way here. Besides, we’ve already answered the headline question – data locality is achieved by having a common partition key. So, we need some clarity as to what question we are really focusing on And, of course, we should be asking the “Cassandra Data Modeling 101” question of what do your queries want to look like, how exactly do you want to access your data. Only after we have a handle on how you need to read your data can we decide how it should be stored. My immediate question to get things back on track: When you say “The typical read is to load a subset of sequences with the same seq_id”, what type of “subset” are you talking about? Again, a few explicit and concise example queries (in some concise, easy to read pseudo language or even plain English, but not belabored with full CQL syntax.) would be very helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” command, so what are we really talking about? Also, I presume we are talking CQL, but some of the references seem more Thrift/slice oriented. -- Jack Krupansky *From:* Eric Stevens migh...@gmail.com *Sent:* Sunday, December 7, 2014 10:12 AM *To:* user@cassandra.apache.org *Subject:* Re: How to model data to achieve specific data locality Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. Kai, unless I'm misunderstanding something, I don't see why you need to alter the table to add a new seq type. From a data model perspective, these are just new values in a row. If you do have columns which are specific to particular seq_types, data modeling does become a little more challenging. In that case you may get some advantage from using collections (especially map) to store data which applies to only a few seq types. Or defining a schema which includes the set of all possible columns (that's when you're getting into ALTERs when a new column comes or goes). All sequences with the same seq_id tend to grow at the same rate. Note that it is an anti pattern in Cassandra to append to the same row indefinitely. I think you understand this because of your original question. But please note that a sub partitioning strategy which reuses subpartitions will result in degraded read performance after a while. You'll need to rotate sub partitions by something that doesn't repeat in order to keep the data for a given partition key grouped into just a few sstables. A typical pattern there is to use some kind of time bucket (hour, day, week, etc., depending on your write volume). I do note that your original question was about preserving data locality - and having a consistent locality for a given seq_id - for best offline analytics. If you wanted to work for this, you can certainly also include a blob value in your partitioning key, whose value is calculated to force a ring collision with this record's sibling data. With Cassandra's default partitioner of murmur3, that's probably pretty challenging - murmur3 isn't designed to be cryptographically strong (it doesn't work to make it difficult to force a collision), but it's meant to have good distribution (it may still be computationally expensive to force a collision - I'm not that familiar with its internal workings). In this case, ByteOrderedPartitioner would be a lot easier to force a ring collision on, but then you need to work on a good ring balancing strategy to distribute your data evenly over the ring. On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan doanduy...@gmail.com wrote: Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly -- Then use bucketing to avoid too wide partitions Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. I am not sure if this is a good practice from operation point of view. -- I don't understand why altering table is necessary to add seq_types. If seq_types is defined as
Re: How to model data to achieve specific data locality
As a general rule, partitions can certainly be much larger than 1 MB, even up to 100 MB. 5 MB to 10 MB might be a good target size. Originally you stated that the number of seq_types could be “unlimited”... is that really true? Is there no practical upper limit you can establish, like 10,000 or 10 million or...? Sure, buckets are a very real option, but if the number of seq_types was only 10,000 to 50,000, then bucketing might be unnecessary complexity and access overhead. -- Jack Krupansky From: Kai Wang Sent: Sunday, December 7, 2014 3:06 PM To: user@cassandra.apache.org Subject: Re: How to model data to achieve specific data locality Thanks for the help. I wasn't clear how clustering column works. Coming from Thrift experience, it took me a while to understand how clustering column impacts partition storage on disk. Now I believe using seq_type as the first clustering column solves my problem. As of partition size, I will start with some bucket assumption. If the partition size exceeds the threshold I may need to re-bucket using smaller bucket size. On another thread Eric mentions the optimal partition size should be at 100 kb ~ 1 MB. I will use that as the start point to design my bucket strategy. On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky j...@basetechnology.com wrote: It would be helpful to look at some specific examples of sequences, showing how they grow. I suspect that the term “sequence” is being overloaded in some subtly misleading way here. Besides, we’ve already answered the headline question – data locality is achieved by having a common partition key. So, we need some clarity as to what question we are really focusing on And, of course, we should be asking the “Cassandra Data Modeling 101” question of what do your queries want to look like, how exactly do you want to access your data. Only after we have a handle on how you need to read your data can we decide how it should be stored. My immediate question to get things back on track: When you say “The typical read is to load a subset of sequences with the same seq_id”, what type of “subset” are you talking about? Again, a few explicit and concise example queries (in some concise, easy to read pseudo language or even plain English, but not belabored with full CQL syntax.) would be very helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” command, so what are we really talking about? Also, I presume we are talking CQL, but some of the references seem more Thrift/slice oriented. -- Jack Krupansky From: Eric Stevens Sent: Sunday, December 7, 2014 10:12 AM To: user@cassandra.apache.org Subject: Re: How to model data to achieve specific data locality Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. Kai, unless I'm misunderstanding something, I don't see why you need to alter the table to add a new seq type. From a data model perspective, these are just new values in a row. If you do have columns which are specific to particular seq_types, data modeling does become a little more challenging. In that case you may get some advantage from using collections (especially map) to store data which applies to only a few seq types. Or defining a schema which includes the set of all possible columns (that's when you're getting into ALTERs when a new column comes or goes). All sequences with the same seq_id tend to grow at the same rate. Note that it is an anti pattern in Cassandra to append to the same row indefinitely. I think you understand this because of your original question. But please note that a sub partitioning strategy which reuses subpartitions will result in degraded read performance after a while. You'll need to rotate sub partitions by something that doesn't repeat in order to keep the data for a given partition key grouped into just a few sstables. A typical pattern there is to use some kind of time bucket (hour, day, week, etc., depending on your write volume). I do note that your original question was about preserving data locality - and having a consistent locality for a given seq_id - for best offline analytics. If you wanted to work for this, you can certainly also include a blob value in your partitioning key, whose value is calculated to force a ring collision with this record's sibling data. With Cassandra's default partitioner of murmur3, that's probably pretty challenging - murmur3 isn't designed to be cryptographically strong (it doesn't work to make it difficult to force a collision), but it's meant to have good distribution (it may still be computationally expensive to force a collision - I'm not that familiar with its internal workings). In this case, ByteOrderedPartitioner would be a lot easier to force a ring collision on, but then you need to work on a good ring balancing strategy to distribute your
Re: How to model data to achieve specific data locality
I think he mentioned 100MB as the max size - planning for 1mb might make your data model difficult to work. On Sun Dec 07 2014 at 12:07:47 PM Kai Wang dep...@gmail.com wrote: Thanks for the help. I wasn't clear how clustering column works. Coming from Thrift experience, it took me a while to understand how clustering column impacts partition storage on disk. Now I believe using seq_type as the first clustering column solves my problem. As of partition size, I will start with some bucket assumption. If the partition size exceeds the threshold I may need to re-bucket using smaller bucket size. On another thread Eric mentions the optimal partition size should be at 100 kb ~ 1 MB. I will use that as the start point to design my bucket strategy. On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky j...@basetechnology.com wrote: It would be helpful to look at some specific examples of sequences, showing how they grow. I suspect that the term “sequence” is being overloaded in some subtly misleading way here. Besides, we’ve already answered the headline question – data locality is achieved by having a common partition key. So, we need some clarity as to what question we are really focusing on And, of course, we should be asking the “Cassandra Data Modeling 101” question of what do your queries want to look like, how exactly do you want to access your data. Only after we have a handle on how you need to read your data can we decide how it should be stored. My immediate question to get things back on track: When you say “The typical read is to load a subset of sequences with the same seq_id”, what type of “subset” are you talking about? Again, a few explicit and concise example queries (in some concise, easy to read pseudo language or even plain English, but not belabored with full CQL syntax.) would be very helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” command, so what are we really talking about? Also, I presume we are talking CQL, but some of the references seem more Thrift/slice oriented. -- Jack Krupansky *From:* Eric Stevens migh...@gmail.com *Sent:* Sunday, December 7, 2014 10:12 AM *To:* user@cassandra.apache.org *Subject:* Re: How to model data to achieve specific data locality Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. Kai, unless I'm misunderstanding something, I don't see why you need to alter the table to add a new seq type. From a data model perspective, these are just new values in a row. If you do have columns which are specific to particular seq_types, data modeling does become a little more challenging. In that case you may get some advantage from using collections (especially map) to store data which applies to only a few seq types. Or defining a schema which includes the set of all possible columns (that's when you're getting into ALTERs when a new column comes or goes). All sequences with the same seq_id tend to grow at the same rate. Note that it is an anti pattern in Cassandra to append to the same row indefinitely. I think you understand this because of your original question. But please note that a sub partitioning strategy which reuses subpartitions will result in degraded read performance after a while. You'll need to rotate sub partitions by something that doesn't repeat in order to keep the data for a given partition key grouped into just a few sstables. A typical pattern there is to use some kind of time bucket (hour, day, week, etc., depending on your write volume). I do note that your original question was about preserving data locality - and having a consistent locality for a given seq_id - for best offline analytics. If you wanted to work for this, you can certainly also include a blob value in your partitioning key, whose value is calculated to force a ring collision with this record's sibling data. With Cassandra's default partitioner of murmur3, that's probably pretty challenging - murmur3 isn't designed to be cryptographically strong (it doesn't work to make it difficult to force a collision), but it's meant to have good distribution (it may still be computationally expensive to force a collision - I'm not that familiar with its internal workings). In this case, ByteOrderedPartitioner would be a lot easier to force a ring collision on, but then you need to work on a good ring balancing strategy to distribute your data evenly over the ring. On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan doanduy...@gmail.com wrote: Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly -- Then use bucketing to avoid too wide partitions Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and
Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2
Hi Eric, Thank you very much for your reply! Do you mean that I should clear my table after each run? Indeed, I can see several times of compaction during my test, but could only a few times compaction affect the performance that much? Also, I can see from the OpsCenter some ParNew GC happen but no CMS GC happen. I run my test on EC2 cluster, I think the network could be of high speed with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD storage, which is of m3.xlarge type. As for latency, which latency should I care about most? p(99) or p(999)? I want to get the max QPS under a certain limited latency. I know my testing scenario are not the common case in production, I just want to know how much burden my cluster can bear under stress. So, how did you test your cluster that can get 86k writes/sec? How many requests did you send to your cluster? Was it also 1 million? Did you also use OpsCenter to monitor the real time performance? I also wonder why the write and read QPS OpsCenter provide are much lower than what I calculate. Could you please describe in detail about your test deployment? Thank you very much, Joy 2014-12-07 23:55 GMT+08:00 Eric Stevens migh...@gmail.com: Hi Joy, Are you resetting your data after each test run? I wonder if your tests are actually causing you to fall behind on data grooming tasks such as compaction, and so performance suffers for your later tests. There are *so many* factors which can affect performance, without reviewing test methodology in great detail, it's really hard to say whether there are flaws which might uncover an antipattern cause atypical number of cache hits or misses, and so forth. You may also be producing gc pressure in the write path, and so forth. I *can* say that 28k writes per second looks just a little low, but it depends a lot on your network, hardware, and write patterns (eg, data size). For a little performance test suite I wrote, with parallel batched writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per second. Also focusing exclusively on max latency is going to cause you some troubles especially in the case of magnetic media as you're using. Between ill-timed GC and inconsistent performance characteristics from magnetic media, your max numbers will often look significantly worse than your p(99) or p(999) numbers. All this said, one node will often look better than several nodes for certain patterns because it completely eliminates proxy (coordinator) write times. All writes are local writes. It's an over-simple case that doesn't reflect any practical production use of Cassandra, so it's probably not worth even including in your tests. I would recommend start at 3 nodes rf=3, and compare against 6 nodes rf=6. Make sure you're staying on top of compaction and aren't seeing garbage collections in the logs (either of those will be polluting your results with variability you can't account for with small sample sizes of ~1 million). If you expect to sustain write volumes like this, you'll find these clusters are sized too small (on that hardware you won't keep up with compaction), and your tests are again testing scenarios you wouldn't actually see in production. On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote: Hi, I am doing stress test on Datastax Cassandra Community 2.1.2, not using the provided stress test tool, but use my own stress-test client code instead(I write some C++ stress test code). My Cassandra cluster is deployed on Amazon EC2, using the provided Datastax Community AMI( HVM instances ) in the Datastax document, and I am not using EBS, just using the ephemeral storage by default. The EC2 type of Cassandra servers are m3.xlarge. I use another EC2 instance for my stress test client, which is of type r3.8xlarge. Both the Cassandra sever nodes and stress test client node are in us-east. I test the Cassandra cluster which is made up of 1 node, 2 nodes, and 4 nodes separately. I just do INSERT test and SELECT test separately, but the performance doesn’t get linear increment when new nodes are added. Also I get some weird results. My test results are as follows(*I do 1 million operations and I try to get the best QPS when the max latency is no more than 200ms, and the latencies are measured from the client side. The QPS is calculated by total_operations/total_time).* *INSERT(write):* Node count Replication factor QPS Average latency(ms) Min latency(ms) .95 latency(ms) .99 latency(ms) .999 latency(ms) Max latency(ms) 1 1 18687 2.08 1.48 2.95 5.74 52.8 205.4 2 1 20793 3.15 0.84 7.71 41.35 88.7 232.7 2 2 22498 3.37 0.86 6.04 36.1 221.5 649.3 4 1 28348 4.38 0.85 8.19 64.51 169.4 251.9 4 3 28631 5.22 0.87 18.68 68.35 167.2 288 *SELECT(read):* Node count Replication factor QPS Average
Re: Could ring cache really improve performance in Cassandra?
I find under the src/client folder of Cassandra 2.1.0 source code, there is a *RingCache.java* file. It uses a thrift client calling the* describe_ring()* API to get the token range of each Cassandra node. It is used on the client side. The client can use it combined with the partitioner to get the target node. In this way there is no need to route requests between Cassandra nodes, and the client can directly connect to the target node. So maybe it can save some routing time and improve performance. Thank you very much. 2014-12-08 1:28 GMT+08:00 Jonathan Haddad j...@jonhaddad.com: What's a ring cache? FYI if you're using the DataStax CQL drivers they will automatically route requests to the correct node. On Sun Dec 07 2014 at 12:59:36 AM kong kongjiali...@gmail.com wrote: Hi, I'm doing stress test on Cassandra. And I learn that using ring cache can improve the performance because the client requests can directly go to the target Cassandra server and the coordinator Cassandra node is the desired target node. In this way, there is no need for coordinator node to route the client requests to the target node, and maybe we can get the linear performance increment. However, in my stress test on an Amazon EC2 cluster, the test results are weird. Seems that there's no performance improvement after using ring cache. Could anyone help me explain this results? (Also, I think the results of test without ring cache is weird, because there's no linear increment on QPS when new nodes are added. I need help on explaining this, too). The results are as follows: INSERT(write): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 18687 20195 2 1 20793 26403 2 2 22498 21263 4 1 28348 30010 4 3 28631 24413 SELECT(read): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 24498 22802 2 1 28219 27030 2 2 35383 36674 4 1 34648 28347 4 3 52932 52590 Thank you very much, Joy
Can not connect with cqlsh to something different than localhost
I am running Cassandra 2.1.2 in an Ubuntu VM. cqlsh or cqlsh localhost works fine. But I can not connect from outside the VM (firewall, etc. disabled). Even when I do cqlsh 192.168.111.136 in my VM I get connection refused. This is strange because when I check my network config I can see that 192.168.111.136 is my IP: root@ubuntu:~# ifconfig eth0 Link encap:Ethernet HWaddr 00:0c:29:02:e0:de inet addr:192.168.111.136 Bcast:192.168.111.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:fe02:e0de/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:16042 errors:0 dropped:0 overruns:0 frame:0 TX packets:8638 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21307125 (21.3 MB) TX bytes:709471 (709.4 KB) loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:550 errors:0 dropped:0 overruns:0 frame:0 TX packets:550 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:148053 (148.0 KB) TX bytes:148053 (148.0 KB) root@ubuntu:~# cqlsh 192.168.111.136 9042 Connection error: ('Unable to connect to any servers', {'192.168.111.136': error(111, Tried connecting to [('192.168.111.136', 9042)]. Last error: Connection refused)}) What to do?
Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2
I think your client could use improvements. How many threads do you have running in your test? With a thrift call like that you only can do one request at a time per connection. For example, assuming C* takes 0ms, a 10ms network latency/driver overhead will mean 20ms RTT and a max throughput of ~50 QPS per thread (native binary doesn't behave like this). Are you running client on its own system or shared with a node? how are you load balancing your requests? Source code would help since theres a lot that can become a bottleneck. Generally you will see a bit of a dip in latency from N=RF=1 and N=2, RF=2 etc since there are optimizations on the coordinator node when it doesn't need to send the request to the replicas. The impact of the network overhead decreases in significance as cluster grows. Typically; latency wise, RF=N=1 is going to be fastest possible for smaller loads (ie when a client cannot fully saturate a single node). Main thing to expect is that latency will plateau and remain fairly constant as load/nodes increase while throughput potential will linearly (empirically at least) increase. You should really attempt it with the native binary + prepared statements, running cql over thrift is far from optimal. I would recommend using the cassandra-stress tool if you want to stress test Cassandra (and not your code) http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema === Chris Lohfink On Sun, Dec 7, 2014 at 9:48 PM, 孔嘉林 kongjiali...@gmail.com wrote: Hi Eric, Thank you very much for your reply! Do you mean that I should clear my table after each run? Indeed, I can see several times of compaction during my test, but could only a few times compaction affect the performance that much? Also, I can see from the OpsCenter some ParNew GC happen but no CMS GC happen. I run my test on EC2 cluster, I think the network could be of high speed with in it. Each Cassandra server has 4 units CPU, 15 GiB memory and 80 SSD storage, which is of m3.xlarge type. As for latency, which latency should I care about most? p(99) or p(999)? I want to get the max QPS under a certain limited latency. I know my testing scenario are not the common case in production, I just want to know how much burden my cluster can bear under stress. So, how did you test your cluster that can get 86k writes/sec? How many requests did you send to your cluster? Was it also 1 million? Did you also use OpsCenter to monitor the real time performance? I also wonder why the write and read QPS OpsCenter provide are much lower than what I calculate. Could you please describe in detail about your test deployment? Thank you very much, Joy 2014-12-07 23:55 GMT+08:00 Eric Stevens migh...@gmail.com: Hi Joy, Are you resetting your data after each test run? I wonder if your tests are actually causing you to fall behind on data grooming tasks such as compaction, and so performance suffers for your later tests. There are *so many* factors which can affect performance, without reviewing test methodology in great detail, it's really hard to say whether there are flaws which might uncover an antipattern cause atypical number of cache hits or misses, and so forth. You may also be producing gc pressure in the write path, and so forth. I *can* say that 28k writes per second looks just a little low, but it depends a lot on your network, hardware, and write patterns (eg, data size). For a little performance test suite I wrote, with parallel batched writes, on a 3 node rf=3 cluster test cluster, I got about 86k writes per second. Also focusing exclusively on max latency is going to cause you some troubles especially in the case of magnetic media as you're using. Between ill-timed GC and inconsistent performance characteristics from magnetic media, your max numbers will often look significantly worse than your p(99) or p(999) numbers. All this said, one node will often look better than several nodes for certain patterns because it completely eliminates proxy (coordinator) write times. All writes are local writes. It's an over-simple case that doesn't reflect any practical production use of Cassandra, so it's probably not worth even including in your tests. I would recommend start at 3 nodes rf=3, and compare against 6 nodes rf=6. Make sure you're staying on top of compaction and aren't seeing garbage collections in the logs (either of those will be polluting your results with variability you can't account for with small sample sizes of ~1 million). If you expect to sustain write volumes like this, you'll find these clusters are sized too small (on that hardware you won't keep up with compaction), and your tests are again testing scenarios you wouldn't actually see in production. On Sat Dec 06 2014 at 7:09:18 AM kong kongjiali...@gmail.com wrote: Hi, I am doing stress test on Datastax Cassandra Community 2.1.2, not using the provided stress test
Re: Can not connect with cqlsh to something different than localhost
Try: $ netstat -lnt and see which interface port 9042 is listening on. You will likely need to update cassandra.yaml to change the interface. By default, Cassandra is listening on localhost so your local cqlsh session works. On Sun, 7 Dec 2014 23:44 Richard Snowden richard.t.snow...@gmail.com wrote: I am running Cassandra 2.1.2 in an Ubuntu VM. cqlsh or cqlsh localhost works fine. But I can not connect from outside the VM (firewall, etc. disabled). Even when I do cqlsh 192.168.111.136 in my VM I get connection refused. This is strange because when I check my network config I can see that 192.168.111.136 is my IP: root@ubuntu:~# ifconfig eth0 Link encap:Ethernet HWaddr 00:0c:29:02:e0:de inet addr:192.168.111.136 Bcast:192.168.111.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:fe02:e0de/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:16042 errors:0 dropped:0 overruns:0 frame:0 TX packets:8638 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21307125 (21.3 MB) TX bytes:709471 (709.4 KB) loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:550 errors:0 dropped:0 overruns:0 frame:0 TX packets:550 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:148053 (148.0 KB) TX bytes:148053 (148.0 KB) root@ubuntu:~# cqlsh 192.168.111.136 9042 Connection error: ('Unable to connect to any servers', {'192.168.111.136': error(111, Tried connecting to [('192.168.111.136', 9042)]. Last error: Connection refused)}) What to do?
re: UPDATE statement is failed
Hi All, There is a practices for Cassandra UPDATE statement. Maybe is not the best, but it is a reference for you to update a row in high frequency. The Cassandra will be failed if UPDATE statement is executed more than once on the same row. In the end, I change the primary key to let Cassandra insert a new row after executing UPDATE statement, then delete all the redundant rows. I also found that the UPDATE statement may be failed if it follows the DELETE statement immediately. The SELECT statement is used to check the last UPDATE statement is executed correctly. Peter 发件人: 鄢来琼 [mailto:laiqiong@gtafe.com] 发送时间: 2014年12月3日 13:08 收件人: user@cassandra.apache.org 主题: re: UPDATE statement is failed The system setting is as the following: Cluster replication: replication = {'class': 'NetworkTopologyStrategy', 'GTA_SZ_DC1':2} Totally, 5 Nodes, OS of Nodes are windows. Thanks Regards, 鄢来琼 / Peter YAN, Staff Software Engineer, A3 Dept., GTA Information Technology Co., Ltd. = Mobile: 18620306659 E-Mail: laiqiong@gtafe.commailto:laiqiong@gtafe.com Website: http://www.gtafe.com/ = 发件人: 鄢来琼 [mailto:laiqiong@gtafe.com] 发送时间: 2014年12月3日 11:49 收件人: user@cassandra.apache.org 主题: UPDATE statement is failed Hi ALL, There is a grogram to consume messages from queue; according to the message, the program will READ a row and then UPDATE the row; BUT, sometimes, the UPDATE statement is fail, the result of READ statement is also the old content before UDPATE. Any suggestions from you are appreciated. The following are my program and the result. ---READ from cassandra.query import SimpleStatement from cassandra import ConsistencyLevel self.interval_data_get_simple = SimpleStatement(SELECT TRADETIME, OPENPRICE, HIGHPRICE, \ LOWPRICE, CLOSEPRICE, CHANGE, CHANGERATIO, VOLUME, AMOUNT,SECURITYNAME, \ SECURITYID from {} WHERE SYMBOL = '{}' AND TRADETIME = '{}';\ .format(self.cassandra_table, symbol, \ interval_trade_time.strftime(u'%Y-%m-%d %H:%M:%S')), \ consistency_level=ConsistencyLevel.ALL) cur_interval_future = self.cassandra_session.execute_async(self.interval_data_get_simple) -UPDATE from cassandra.query import SimpleStatement from cassandra import ConsistencyLevel data_set_simple = SimpleStatement(UPDATE {} SET OPENPRICE = {}, HIGHPRICE = {}, LOWPRICE = {},\ CLOSEPRICE = {}, VOLUME = {}, AMOUNT = {}, MARKET = {}, SECURITYID = {} WHERE \ SYMBOL = '{}' AND TRADETIME = '{}';.format(self.cassandra_table, insert_data_list[0], \ insert_data_list[1], insert_data_list[2], insert_data_list[3], \ insert_data_list[4], insert_data_list[5], insert_data_list[6], \ insert_data_list[7], insert_data_list[8], insert_data_list[9]), \ consistency_level=ConsistencyLevel.ALL) update_future = self.cassandra_session.execute(data_set_simple) test result-- #CQL UPDATE statement UPDATE GTA_HFDCS_SSEL2.SSEL2_TRDMIN01_20141127 SET OPENPRICE = 8.460, HIGHPRICE = 8.460, LOWPRICE = 8.460, CLOSEPRICE = 8.460, VOLUME = 1500, AMOUNT = 12240.000, MARKET = 1, SECURITYID = 20103592 WHERE SYMBOL = '600256' AND TRADETIME = '2014-11-27 10:00:00'; #the result of READ [Row(tradetime=datetime.datetime(2014, 11, 27, 2, 0), openprice=Decimal('8.460'), highprice=Decimal('8.460'), lowprice=Decimal('8.460'), closeprice=Decimal('8.460'), change=None, changeratio=None, volume=1500, amount=Decimal('12240.000'), securityname=None, securityid=20103592)] #CQL UPDATE statement UPDATE GTA_HFDCS_SSEL2.SSEL2_TRDMIN01_20141127 SET OPENPRICE = 8.460, HIGHPRICE = 8.460, LOWPRICE = 8.160, CLOSEPRICE = 8.160, VOLUME = 3500, AMOUNT = 28560.000, MARKET = 1, SECURITYID = 20103592 WHERE SYMBOL = '600256' AND TRADETIME = '2014-11-27 10:00:00'; #the result of READ [Row(tradetime=datetime.datetime(2014, 11, 27, 2, 0), openprice=Decimal('8.460'), highprice=Decimal('8.460'), lowprice=Decimal('8.460'), closeprice=Decimal('8.460'), change=None, changeratio=None, volume=1500, amount=Decimal('12240.000'), securityname=None, securityid=20103592)] Thanks Regards, 鄢来琼 / Peter YAN, Staff Software Engineer, A3 Dept., GTA Information Technology Co., Ltd. = Mobile: 18620306659 E-Mail: laiqiong@gtafe.commailto:laiqiong@gtafe.com Website: http://www.gtafe.com/ =
Re: Could ring cache really improve performance in Cassandra?
I would really not recommend using thrift for anything at this point, including your load tests. Take a look at CQL, all development is going there and has in 2.1 seen a massive performance boost over 2.0. You may want to try the Cassandra stress tool included in 2.1, it can stress a table you've already built. That way you can rule out any bugs on the client side. If you're going to keep using your tool, however, it would be helpful if you sent out a link to the repo, since currently we have no way of knowing if you've got a client side bug (data model or code) that's limiting your performance. On Sun Dec 07 2014 at 7:55:16 PM 孔嘉林 kongjiali...@gmail.com wrote: I find under the src/client folder of Cassandra 2.1.0 source code, there is a *RingCache.java* file. It uses a thrift client calling the* describe_ring()* API to get the token range of each Cassandra node. It is used on the client side. The client can use it combined with the partitioner to get the target node. In this way there is no need to route requests between Cassandra nodes, and the client can directly connect to the target node. So maybe it can save some routing time and improve performance. Thank you very much. 2014-12-08 1:28 GMT+08:00 Jonathan Haddad j...@jonhaddad.com: What's a ring cache? FYI if you're using the DataStax CQL drivers they will automatically route requests to the correct node. On Sun Dec 07 2014 at 12:59:36 AM kong kongjiali...@gmail.com wrote: Hi, I'm doing stress test on Cassandra. And I learn that using ring cache can improve the performance because the client requests can directly go to the target Cassandra server and the coordinator Cassandra node is the desired target node. In this way, there is no need for coordinator node to route the client requests to the target node, and maybe we can get the linear performance increment. However, in my stress test on an Amazon EC2 cluster, the test results are weird. Seems that there's no performance improvement after using ring cache. Could anyone help me explain this results? (Also, I think the results of test without ring cache is weird, because there's no linear increment on QPS when new nodes are added. I need help on explaining this, too). The results are as follows: INSERT(write): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 18687 20195 2 1 20793 26403 2 2 22498 21263 4 1 28348 30010 4 3 28631 24413 SELECT(read): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 24498 22802 2 1 28219 27030 2 2 35383 36674 4 1 34648 28347 4 3 52932 52590 Thank you very much, Joy