Re: Which of these VPS configurations would perform better for Cassandra ?
@David: Like all other start-ups, we too cannot start with all dedicated servers for Cassandra. So right now we have no better choice except for using a VPS :), but we can definitely choose one from amongst a suitable set of VPS configurations. As of now since we are starting out, could we initiate our cluster with 2 nodes(RF=2), (KVM, 2GB ram, 2 cores, 30GB SDD) . Right now we wont we having a very heavy load on Cassandra until a next few months till we grow our user base. So, this choice is mainly based on the pricing vs configuration as well as digital ocean's good reputation in the community. On Sun, Aug 4, 2013 at 12:53 AM, David Schairer dschai...@humbaba.netwrote: I've run several lab configurations on linodes; I wouldn't run cassandra on any shared virtual platform for large-scale production, just because your IO performance is going to be really hard to predict. Lots of people do, though -- depends on your cassandra loads and how consistent you need to have performance be, as well as how much of your working set will fit into memory. Remember that linode significantly oversells their CPU as well. The release version of KVM, at least as of a few months ago, still doesn't support TRIM on SSD; that, plus the fact that you don't know how others will use SSDs or if their file systems will keep the SSDs healthy, means that SSD performance on KVM is going to be highly unpredictable. I have not tested digitalocean, but I did test several other KVM+SSD shared-tenant hosting providers aggressively for cassandra a couple months ago; they all failed badly. Your mileage will vary considerably based on what you need out of cassandra, what your data patterns look like, and how you configure your system. That said, I would use xen before KVM for high-performance IO. I have not run Cassandra in any volume on Amazon -- lots of folks have, and may have recommendations (including SSD) there for where it falls on the price/performance curve. --DRS On Aug 3, 2013, at 11:33 AM, Ertio Lew ertio...@gmail.com wrote: I am building a cluster(initially starting with a 2-3 nodes cluster). I have came across two seemingly good options for hosting, Linode Digital Ocean. VPS configuration for both listed below: Linode:- -- XEN Virtualization 2 GB RAM 8 cores CPU (2x priority) (8 processor Xen instances) 96 GB Storage Digital Ocean:- - KVM Virtualization 2GB Memory 2 Cores 40GB **SSD Disk*** Digitial Ocean's VPS is at half price of above listed Linode VPS, Could you clarify which of these two VPS would be better as Cassandra nodes ?
Re: Which of these VPS configurations would perform better for Cassandra ?
with 2 GB RAM be prepared to expect crashes because it hardly can handle peaks with increased memory consumption by compaction, validation, etc. KVM works good only if you are using recent version and virtio drivers and provider is not overselling memory. At shared hosting you will not be able to handle io loads during peak times and handling peak times is most important for any web site. get 2 old computers in second hand shop, put 6 hard drives to each and 8 GB RAM, then send these servers to hosting facility. you are trying to save money at wrong place.
Re: Which of these VPS configurations would perform better for Cassandra ?
okay, so what should a workable VPS configuration to start with minimum how many nodes to start with(2 ok?) ? Seriously I cannot afford the tensions of colocation setup. My hosting provider provides SSD drives with KVM virtualization.
Re: Which of these VPS configurations would perform better for Cassandra ?
Of course -- my point is simply that if you're looking for speed, SSD+KVM, especially in a shared tenant situation, is unlikely to perform the way you want to. If you're building a pure proof of concept that never stresses the system, it doesn't matter, but if you plan an MVP with any sort of scale, you'll want a plan to be on something more robust. I'll also say that it's really important (imho) to be doing even your dev in a config where you have consistency conditions like eventual production -- so make sure you're writing to both nodes and can have cases where eventual consistency delays kick in, or it'll come back to bite you later -- I've seen this force people to redesign their whole data model when they don't plan for it initially. As I said, I haven't tested DO. I've tested very similar configurations at other providers and they were all terrible under load -- and certainly took away most of the benefits of SSD once you stressed writes a bit. XEN+SSD, on modern kernels, should work better, but I didn't test it (linode doesn't offer this, though, and they've had lots of other challenges of late). --DRS On Aug 3, 2013, at 11:40 PM, Ertio Lew ertio...@gmail.com wrote: @David: Like all other start-ups, we too cannot start with all dedicated servers for Cassandra. So right now we have no better choice except for using a VPS :), but we can definitely choose one from amongst a suitable set of VPS configurations. As of now since we are starting out, could we initiate our cluster with 2 nodes(RF=2), (KVM, 2GB ram, 2 cores, 30GB SDD) . Right now we wont we having a very heavy load on Cassandra until a next few months till we grow our user base. So, this choice is mainly based on the pricing vs configuration as well as digital ocean's good reputation in the community. On Sun, Aug 4, 2013 at 12:53 AM, David Schairer dschai...@humbaba.net wrote: I've run several lab configurations on linodes; I wouldn't run cassandra on any shared virtual platform for large-scale production, just because your IO performance is going to be really hard to predict. Lots of people do, though -- depends on your cassandra loads and how consistent you need to have performance be, as well as how much of your working set will fit into memory. Remember that linode significantly oversells their CPU as well. The release version of KVM, at least as of a few months ago, still doesn't support TRIM on SSD; that, plus the fact that you don't know how others will use SSDs or if their file systems will keep the SSDs healthy, means that SSD performance on KVM is going to be highly unpredictable. I have not tested digitalocean, but I did test several other KVM+SSD shared-tenant hosting providers aggressively for cassandra a couple months ago; they all failed badly. Your mileage will vary considerably based on what you need out of cassandra, what your data patterns look like, and how you configure your system. That said, I would use xen before KVM for high-performance IO. I have not run Cassandra in any volume on Amazon -- lots of folks have, and may have recommendations (including SSD) there for where it falls on the price/performance curve. --DRS On Aug 3, 2013, at 11:33 AM, Ertio Lew ertio...@gmail.com wrote: I am building a cluster(initially starting with a 2-3 nodes cluster). I have came across two seemingly good options for hosting, Linode Digital Ocean. VPS configuration for both listed below: Linode:- -- XEN Virtualization 2 GB RAM 8 cores CPU (2x priority) (8 processor Xen instances) 96 GB Storage Digital Ocean:- - KVM Virtualization 2GB Memory 2 Cores 40GB **SSD Disk*** Digitial Ocean's VPS is at half price of above listed Linode VPS, Could you clarify which of these two VPS would be better as Cassandra nodes ?
Re: Which of these VPS configurations would perform better for Cassandra ?
workable configuration depends on your requirements. You need to develop own testing procedure. How much data will have whats 95 percentile response time target size of rows number of columns per row data grow rate data rewrite rate ttl expiration used never aim for minimum. Cassandra has huge difference during load spikes.
Re: org.apache.cassandra.io.sstable.CorruptSSTableException
Re-sending hoping to get some help. Any ideas would be much appreciated! From: Keith Wright kwri...@nanigans.commailto:kwri...@nanigans.com Date: Friday, August 2, 2013 3:01 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: org.apache.cassandra.io.sstable.CorruptSSTableException Hi all, We just added a node to our cluster (1.2.4 Vnodes) and they appear to be running well exception I see that the new node is not making any progress compacting one of the CF. The exception below is generated. My assumption is that the only way to handle this is to stop the node, delete the file in question, restart, and run repair. Thoughts? org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: dataSize of 1249463589142530 starting at 5604968 would be larger than file /data/3/cassandra/data/users/global_user/users-global_user-ib-1550-Data.db length 14017479 at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:168) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:83) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:69) at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:177) at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:152) at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:139) at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:36) at org.apache.cassandra.db.compaction.ParallelCompactionIterable$Deserializer$1.runMayThrow(ParallelCompactionIterable.java:288) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.IOException: dataSize of 1249463589142530 starting at 5604968 would be larger than file /data/3/cassandra/data/users/global_user/users-global_user-ib-1550-Data.db length 14017479 at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:123) ... 9 more
Re: Which of these VPS configurations would perform better for Cassandra ?
If you want to get a rough idea of how things will perform, fire up YCSB (https://github.com/brianfrankcooper/YCSB/wiki) and run the tests that closest match how you think your workload will be (run the test clients from a couple of beefy AWS spot-instances for less than a dollar). As you are a new startup without any existing load/traffic patterns, benchmarking will be your best bet. As a have a look at running Cassandra with SmartOS on Joyent. When you run SmartOS on Joyent virtualisation is done using solaris zones, an OS based virtualisation, which is at least a quadrillion times better than KVM, xen etc. Ok maybe not that much… but it is pretty cool and has the following benefits: - No hardware emulation. - Shared kernel with the host (you don't have to waste precious memory running a guest os). - ZFS :) Have a read of http://wiki.smartos.org/display/DOC/SmartOS+Virtualization for more info. There are some downsides as well: The version of Cassandra that comes with the SmartOS package management system is old and busted, so you will want to build from source. You will want to be technically confident in running on something a little outside the norm (SmartOS is based on Solaris). Just make sure you test and benchmark all your options, a few days of testing now will save you weeks of pain. Good luck! Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr On 05/08/2013, at 12:34 AM, David Schairer dschai...@humbaba.net wrote: Of course -- my point is simply that if you're looking for speed, SSD+KVM, especially in a shared tenant situation, is unlikely to perform the way you want to. If you're building a pure proof of concept that never stresses the system, it doesn't matter, but if you plan an MVP with any sort of scale, you'll want a plan to be on something more robust. I'll also say that it's really important (imho) to be doing even your dev in a config where you have consistency conditions like eventual production -- so make sure you're writing to both nodes and can have cases where eventual consistency delays kick in, or it'll come back to bite you later -- I've seen this force people to redesign their whole data model when they don't plan for it initially. As I said, I haven't tested DO. I've tested very similar configurations at other providers and they were all terrible under load -- and certainly took away most of the benefits of SSD once you stressed writes a bit. XEN+SSD, on modern kernels, should work better, but I didn't test it (linode doesn't offer this, though, and they've had lots of other challenges of late). --DRS On Aug 3, 2013, at 11:40 PM, Ertio Lew ertio...@gmail.com wrote: @David: Like all other start-ups, we too cannot start with all dedicated servers for Cassandra. So right now we have no better choice except for using a VPS :), but we can definitely choose one from amongst a suitable set of VPS configurations. As of now since we are starting out, could we initiate our cluster with 2 nodes(RF=2), (KVM, 2GB ram, 2 cores, 30GB SDD) . Right now we wont we having a very heavy load on Cassandra until a next few months till we grow our user base. So, this choice is mainly based on the pricing vs configuration as well as digital ocean's good reputation in the community. On Sun, Aug 4, 2013 at 12:53 AM, David Schairer dschai...@humbaba.net wrote: I've run several lab configurations on linodes; I wouldn't run cassandra on any shared virtual platform for large-scale production, just because your IO performance is going to be really hard to predict. Lots of people do, though -- depends on your cassandra loads and how consistent you need to have performance be, as well as how much of your working set will fit into memory. Remember that linode significantly oversells their CPU as well. The release version of KVM, at least as of a few months ago, still doesn't support TRIM on SSD; that, plus the fact that you don't know how others will use SSDs or if their file systems will keep the SSDs healthy, means that SSD performance on KVM is going to be highly unpredictable. I have not tested digitalocean, but I did test several other KVM+SSD shared-tenant hosting providers aggressively for cassandra a couple months ago; they all failed badly. Your mileage will vary considerably based on what you need out of cassandra, what your data patterns look like, and how you configure your system. That said, I would use xen before KVM for high-performance IO. I have not run Cassandra in any volume on Amazon -- lots of folks have, and may have recommendations (including SSD) there for where it falls on the price/performance curve. --DRS On Aug 3, 2013, at 11:33 AM, Ertio Lew ertio...@gmail.com wrote: I am building a cluster(initially starting with a 2-3 nodes cluster). I have came across two seemingly good options for hosting, Linode Digital Ocean.
Better to have lower or greater cardinality for partition key in CQL3?
Hello, Was curious what people had found to be better for structuring/modeling data into C*? With my data I have two primary keys, one 64 bit int thats 0 - 50 million ( its unlikely to go higher then 70 million ever ) and another 64 bit that's probably close to hitting a trillion in the next year or so. Looking at how the data is going to behave, for the first few months each row/record will be updated but after that its practically written in stone. Still I was leaning toward leveled compaction as it gets updated anywhere from once an hour to at least once a day for the first 7 days. So from anyones experience, is it better to use a low cardinality partition key or a high cardinality. Additionally data organized by the low cardinality set is probably 1-6B ( and growing ) but the high cardinality would be 1-6MB only 2-3x a year. Thanks, Dave new high cardinality keys in 1 year ~15,768,00,000 new low cardinality keys in 1 year = 10,000-30,000 low cardinality key set size ~1-6GB high cardinality key set size 1-5MB