Re: disk space and tombstones
DuyHai Doan wrote it looks like there is a need for a tool to take care of the bucketing switch But I still can't understand why bucketing should be better than `DELETE row USING TIMESTAMP`. Looks like the only source of truth about this topic is the source code of Cassa. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/disk-space-and-tombstones-tp7596356p7596378.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: cassandra-stress with clustering columns?
Are you interested in cassandra-stress in particular? Or in any tool which will allow you to stress test your schema? I believe Apache Jmeter + CQL plugin may be useful in the latter case. https://github.com/Mishail/CqlJmeter -M On 8/17/14 12:26, Clint Kelly wrote: Hi all, Is there a way to use the cassandra-stress tool with clustering columns? I am trying to figure out whether an application that I'm running on is slow because of my application logic, C* data model, or underlying C* setup (e.g., I need more nodes or to tune some parameters). My application uses tables with several clustering columns and a couple of additional indices and it is running quite slowly under a heavy write load. I think that the problem is my data model (and therefore table layout), but I'd like to confirm by replicating the problem with cassandra-stress. I don't see any option for using clustering columns or secondary indices, but I wanted to check before diving into the code and trying to add this functionality. Thanks! Best regards, Clint
Re: Best way to format a ResultSet / Row ?
Hello, I would try something like that (I have not tested, no guarantee ..) : import com.datastax.driver.core.ColumnDefinitions; import com.datastax.driver.core.ResultSet; import com.datastax.driver.core.Row; import com.datastax.driver.core.utils.Bytes; /* ... */ ResultSet result = null; // Put your instance HERE final StringBuilder builder = new StringBuilder(); for (Row row : result) { builder.append([ ); for (ColumnDefinitions.Definition def : row.getColumnDefinitions()) { String value = Bytes.toHexString(row.getBytesUnsafe(def.getName())); builder.append(def.getName()).append(=).append(value).append( ); } builder.append(] ); } System.out.println(builder.toString()); /* ... */ But this is probably not very usefull, since you get only prints of bytes. You can then test the type of the column (variable 'def') in order to call the best suited method of 'row', so that the variable 'value' can be more readable. Fabrice LARCHER 2014-08-19 3:29 GMT+02:00 Kevin Burton bur...@spinn3r.com: The DataStax java driver has a Row object which getInt, getLong methods… However, the getString only works on string columns. That's probably reasonable… but if I have a raw Row, how the heck do I easily print it? I need a handy way do dump a ResultSet … -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
[RELEASE CANDIDATE] Apache Cassandra 2.1.0-rc6 released
The Cassandra team is pleased to announce the sixth release candidate for the future Apache Cassandra version 2.1.0. Please note that this is not yet the final 2.1.0 release and as such, it should not be considered for production use. We'd appreciate testing and let us know if you encounter any problem[3,4]. Please make sure to have a look at the change log[1] and release notes[2]. Apache Cassandra 2.1.0-rc6[5] is available as usual from the cassandra website (http://cassandra.apache.org/download/) and a debian package is available using the 21x branch (see http://wiki.apache.org/cassandra/DebianPackaging). Enjoy! [1]: http://goo.gl/MyqArD (CHANGES.txt) [2]: http://goo.gl/7vS47U (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA [4]: user@cassandra.apache.org [5]: http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-2.1.0-rc6
Re: Best way to format a ResultSet / Row ?
This kind of question belong to the java driver mailing list, not the Cassandra one, please try to use the proper mailing list in the future. On Tue, Aug 19, 2014 at 10:11 AM, Fabrice Larcher fabrice.larc...@level5.fr wrote: But this is probably not very usefull, since you get only prints of bytes. You can then test the type of the column (variable 'def') in order to call the best suited method of 'row', You don't have to test the type, you can just the deserialize method of the column type. So in Fabrice's example, Object val = def.getType().deserialize(row.getBytesUnsafe(def.getName())); -- Sylvain
Options for expanding Cassandra cluster on AWS
Distinguished Colleagues: Our current Cassandra cluster on AWS looks like this: 3 nodes in N. Virginia, one per zone. RF=3 Each node is a c3.4xlarge with 2x160G SSDs in RAID-0 (~300 Gig SSD on each node). Works great, I find it the most optimal configuration for a Cassandra node. But the time is coming soon when I need to expand storage capacity. I have the following options in front of me: 1) Add 3 more c3.4xlarge nodes. This keeps the amount of data on each node reasonable, and all repairs and other tasks can complete in a reasonable amount of time. The downside is that c3.4xlarge are pricey. 2) Add provisioned EBS volumes. These days I can get SSD-backed EBS with up to 4000 IOPS provisioned. I can add those volumes to data_directories list in Yaml, and I expect Cassandra can deal with that JBOD-style The upside is that it is much cheaper than option #1 above; the downside is that it is a much slower configuration and repairs can take longer. I'd appreciate any input on this topic. Thanks in advance, Oleg
Re: Options for expanding Cassandra cluster on AWS
The last guidance I heard from DataStax was to use m2.2xlarge's on AWS and put data on the ephemeral drivehave they changed this guidance? Brian On Tue, Aug 19, 2014 at 9:41 AM, Oleg Dulin oleg.du...@gmail.com wrote: Distinguished Colleagues: Our current Cassandra cluster on AWS looks like this: 3 nodes in N. Virginia, one per zone. RF=3 Each node is a c3.4xlarge with 2x160G SSDs in RAID-0 (~300 Gig SSD on each node). Works great, I find it the most optimal configuration for a Cassandra node. But the time is coming soon when I need to expand storage capacity. I have the following options in front of me: 1) Add 3 more c3.4xlarge nodes. This keeps the amount of data on each node reasonable, and all repairs and other tasks can complete in a reasonable amount of time. The downside is that c3.4xlarge are pricey. 2) Add provisioned EBS volumes. These days I can get SSD-backed EBS with up to 4000 IOPS provisioned. I can add those volumes to data_directories list in Yaml, and I expect Cassandra can deal with that JBOD-style The upside is that it is much cheaper than option #1 above; the downside is that it is a much slower configuration and repairs can take longer. I'd appreciate any input on this topic. Thanks in advance, Oleg -- http://about.me/BrianTarbox
Re: Options for expanding Cassandra cluster on AWS
I’m not sure about Datastax’s official stance but using the SSD backed instances (ed. i2.2xl, c3.4xl etc) outperform the m2.2xl greatly. Also, since Datastax is pro-ssd, I doubt they would still recommend to stay on magnetic disks. That said, I have benchmarked all the way up to the c3.8xl instances. The most IOPS I could get out of each node was around 4000-5000. This seemed to be because the context switching was preventing Cassandra from stressing the SSD drives to their maximum of 40,000 IOPS. Since the SSD backed EBS volumes offer up to 4000 IOPS, the speed of the disk would not be an issue. You would, however, still be sharing network resources, so without a proper benchmark you would still be rolling the dice. The best bang for the buck I’ve seen is the i2 instances. They offer more ephemeral disk space at less of a cost than the c3, albeit less cpu. We currently use the i2.xlrg and they are working out great. On August 19, 2014 at 10:09:26 AM, Brian Tarbox (briantar...@gmail.com) wrote: The last guidance I heard from DataStax was to use m2.2xlarge's on AWS and put data on the ephemeral drivehave they changed this guidance? Brian On Tue, Aug 19, 2014 at 9:41 AM, Oleg Dulin oleg.du...@gmail.com wrote: Distinguished Colleagues: Our current Cassandra cluster on AWS looks like this: 3 nodes in N. Virginia, one per zone. RF=3 Each node is a c3.4xlarge with 2x160G SSDs in RAID-0 (~300 Gig SSD on each node). Works great, I find it the most optimal configuration for a Cassandra node. But the time is coming soon when I need to expand storage capacity. I have the following options in front of me: 1) Add 3 more c3.4xlarge nodes. This keeps the amount of data on each node reasonable, and all repairs and other tasks can complete in a reasonable amount of time. The downside is that c3.4xlarge are pricey. 2) Add provisioned EBS volumes. These days I can get SSD-backed EBS with up to 4000 IOPS provisioned. I can add those volumes to data_directories list in Yaml, and I expect Cassandra can deal with that JBOD-style The upside is that it is much cheaper than option #1 above; the downside is that it is a much slower configuration and repairs can take longer. I'd appreciate any input on this topic. Thanks in advance, Oleg -- http://about.me/BrianTarbox
Re: cassandra-stress with clustering columns?
Hi Mikail, This plugin looks great! I have actually been using JMeter + a custom REST endpoint driving Cassandra. It would be great to compare the results I got from that against the pure JMeter + Cassandra (to evaluate the REST endpoint's performance). Thanks! I'll check this out. Best regards, Clint On Tue, Aug 19, 2014 at 1:38 AM, Mikhail Stepura mikhail.step...@outlook.com wrote: Are you interested in cassandra-stress in particular? Or in any tool which will allow you to stress test your schema? I believe Apache Jmeter + CQL plugin may be useful in the latter case. https://github.com/Mishail/CqlJmeter -M On 8/17/14 12:26, Clint Kelly wrote: Hi all, Is there a way to use the cassandra-stress tool with clustering columns? I am trying to figure out whether an application that I'm running on is slow because of my application logic, C* data model, or underlying C* setup (e.g., I need more nodes or to tune some parameters). My application uses tables with several clustering columns and a couple of additional indices and it is running quite slowly under a heavy write load. I think that the problem is my data model (and therefore table layout), but I'd like to confirm by replicating the problem with cassandra-stress. I don't see any option for using clustering columns or secondary indices, but I wanted to check before diving into the code and trying to add this functionality. Thanks! Best regards, Clint
Re: cassandra-stress with clustering columns?
The stress tool in 2.1 also now supports clustering columns: http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema There are however some features up for revision before release in order to help generate realistic workloads. See https://issues.apache.org/jira/browse/CASSANDRA-7519 for details. On Tue, Aug 19, 2014 at 10:46 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi Mikail, This plugin looks great! I have actually been using JMeter + a custom REST endpoint driving Cassandra. It would be great to compare the results I got from that against the pure JMeter + Cassandra (to evaluate the REST endpoint's performance). Thanks! I'll check this out. Best regards, Clint On Tue, Aug 19, 2014 at 1:38 AM, Mikhail Stepura mikhail.step...@outlook.com wrote: Are you interested in cassandra-stress in particular? Or in any tool which will allow you to stress test your schema? I believe Apache Jmeter + CQL plugin may be useful in the latter case. https://github.com/Mishail/CqlJmeter -M On 8/17/14 12:26, Clint Kelly wrote: Hi all, Is there a way to use the cassandra-stress tool with clustering columns? I am trying to figure out whether an application that I'm running on is slow because of my application logic, C* data model, or underlying C* setup (e.g., I need more nodes or to tune some parameters). My application uses tables with several clustering columns and a couple of additional indices and it is running quite slowly under a heavy write load. I think that the problem is my data model (and therefore table layout), but I'd like to confirm by replicating the problem with cassandra-stress. I don't see any option for using clustering columns or secondary indices, but I wanted to check before diving into the code and trying to add this functionality. Thanks! Best regards, Clint
EC2 SSD cluster costs
The latest consensus around the web for running Cassandra on EC2 seems to be use new SSD instances. I've not seen any mention of the elephant in the room - using the new SSD instances significantly raises the cluster cost per TB. With Cassandra's strength being linear scalability to many terabytes of data, it strikes me as odd that everyone is recommending such a large storage cost hike almost without reservation. Monthly cost comparison for a 100TB cluster (non-reserved instances): m1.xlarge (2x420 non-SSD): $30,000 (120 nodes) m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option) i2.xlarge (1x800 SSD): $76,000 (125 nodes) Best case, the cost goes up 150%. How are others approaching these new instances? Have you migrated and eaten the costs, or are you staying on previous generation until prices come down?
Re: EC2 SSD cluster costs
Short answer, it depends on your use-case. We migrated to i2.xlarge nodes and saw an immediate increase in performance. If you just need plain ole raw disk space and don’t have a performance requirement to meet then the m1 machines would work, or hell even SSD EBS volumes may work for you. The problem we were having is that we couldn’t fill the m1 machines because we needed to add more nodes for performance. Now we have much more power and just the right amount of disk space. Basically saying, these are not apples-to-apples comparisons On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com) wrote: The latest consensus around the web for running Cassandra on EC2 seems to be use new SSD instances. I've not seen any mention of the elephant in the room - using the new SSD instances significantly raises the cluster cost per TB. With Cassandra's strength being linear scalability to many terabytes of data, it strikes me as odd that everyone is recommending such a large storage cost hike almost without reservation. Monthly cost comparison for a 100TB cluster (non-reserved instances): m1.xlarge (2x420 non-SSD): $30,000 (120 nodes) m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option) i2.xlarge (1x800 SSD): $76,000 (125 nodes) Best case, the cost goes up 150%. How are others approaching these new instances? Have you migrated and eaten the costs, or are you staying on previous generation until prices come down?
Re: EC2 SSD cluster costs
You're pricing it out at $ per GB… that's not the way to look at it. Price it out at $ per IO… Once you price it that way, SSD makes a LOT more sense. Of course, it depends on your workload. If you're just doing writes, and they're all sequential, then cost per IO might not make a lot of sense. We're VERY IO bound… so for us SSD is a no brainer. We were actually all memory before because of this and just finished a big SSD migration … (though on MySQL)… But our Cassandra deploy will be on SSD on Softlayer. It's a no brainer really.. Kevin On Tue, Aug 19, 2014 at 8:56 AM, Jeremy Jongsma jer...@barchart.com wrote: The latest consensus around the web for running Cassandra on EC2 seems to be use new SSD instances. I've not seen any mention of the elephant in the room - using the new SSD instances significantly raises the cluster cost per TB. With Cassandra's strength being linear scalability to many terabytes of data, it strikes me as odd that everyone is recommending such a large storage cost hike almost without reservation. Monthly cost comparison for a 100TB cluster (non-reserved instances): m1.xlarge (2x420 non-SSD): $30,000 (120 nodes) m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option) i2.xlarge (1x800 SSD): $76,000 (125 nodes) Best case, the cost goes up 150%. How are others approaching these new instances? Have you migrated and eaten the costs, or are you staying on previous generation until prices come down? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: cassandra-stress with clustering columns?
Thanks for the update, Benedict. We are still using 2.0.9 unfortunately. :/ I will keep that in mind for when we upgrade. On Tue, Aug 19, 2014 at 10:51 AM, Benedict Elliott Smith belliottsm...@datastax.com wrote: The stress tool in 2.1 also now supports clustering columns: http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema There are however some features up for revision before release in order to help generate realistic workloads. See https://issues.apache.org/jira/browse/CASSANDRA-7519 for details. On Tue, Aug 19, 2014 at 10:46 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi Mikail, This plugin looks great! I have actually been using JMeter + a custom REST endpoint driving Cassandra. It would be great to compare the results I got from that against the pure JMeter + Cassandra (to evaluate the REST endpoint's performance). Thanks! I'll check this out. Best regards, Clint On Tue, Aug 19, 2014 at 1:38 AM, Mikhail Stepura mikhail.step...@outlook.com wrote: Are you interested in cassandra-stress in particular? Or in any tool which will allow you to stress test your schema? I believe Apache Jmeter + CQL plugin may be useful in the latter case. https://github.com/Mishail/CqlJmeter -M On 8/17/14 12:26, Clint Kelly wrote: Hi all, Is there a way to use the cassandra-stress tool with clustering columns? I am trying to figure out whether an application that I'm running on is slow because of my application logic, C* data model, or underlying C* setup (e.g., I need more nodes or to tune some parameters). My application uses tables with several clustering columns and a couple of additional indices and it is running quite slowly under a heavy write load. I think that the problem is my data model (and therefore table layout), but I'd like to confirm by replicating the problem with cassandra-stress. I don't see any option for using clustering columns or secondary indices, but I wanted to check before diving into the code and trying to add this functionality. Thanks! Best regards, Clint
Re: Best way to format a ResultSet / Row ?
I agree that it belongs on that mailing list but it's setup weird.. .I can't subscribe to it in Google Groups… I am not sure what exactly is wrong with it.. mailed the admins but it hasn't been resolved. On Tue, Aug 19, 2014 at 1:49 AM, Sylvain Lebresne sylv...@datastax.com wrote: This kind of question belong to the java driver mailing list, not the Cassandra one, please try to use the proper mailing list in the future. On Tue, Aug 19, 2014 at 10:11 AM, Fabrice Larcher fabrice.larc...@level5.fr wrote: But this is probably not very usefull, since you get only prints of bytes. You can then test the type of the column (variable 'def') in order to call the best suited method of 'row', You don't have to test the type, you can just the deserialize method of the column type. So in Fabrice's example, Object val = def.getType().deserialize(row.getBytesUnsafe(def.getName())); -- Sylvain -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: EC2 SSD cluster costs
Again, depends on your use case. But we wanted to keep the data per node below 500gb, and we found raided ssds to be the best bang for the buck for our cluster. I think we moved to from the i2 to c3 because our bottleneck tended to be CPU utilization (from parsing requests). (Discliamer, we're not cassandra veterans but we're not part of the RF=N=3 club) On Tue, Aug 19, 2014 at 10:00 AM, Russell Bradberry rbradbe...@gmail.com wrote: Short answer, it depends on your use-case. We migrated to i2.xlarge nodes and saw an immediate increase in performance. If you just need plain ole raw disk space and don’t have a performance requirement to meet then the m1 machines would work, or hell even SSD EBS volumes may work for you. The problem we were having is that we couldn’t fill the m1 machines because we needed to add more nodes for performance. Now we have much more power and just the right amount of disk space. Basically saying, these are not apples-to-apples comparisons On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com) wrote: The latest consensus around the web for running Cassandra on EC2 seems to be use new SSD instances. I've not seen any mention of the elephant in the room - using the new SSD instances significantly raises the cluster cost per TB. With Cassandra's strength being linear scalability to many terabytes of data, it strikes me as odd that everyone is recommending such a large storage cost hike almost without reservation. Monthly cost comparison for a 100TB cluster (non-reserved instances): m1.xlarge (2x420 non-SSD): $30,000 (120 nodes) m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option) i2.xlarge (1x800 SSD): $76,000 (125 nodes) Best case, the cost goes up 150%. How are others approaching these new instances? Have you migrated and eaten the costs, or are you staying on previous generation until prices come down?
Manually deleting sstables
After we dropped a table, we noticed that the sstables are still there. After searching through the forum history, I noticed that this is known behavior. 1) Is there any negative impact of deleting the sstables off disk and then restarting Cassandra? 2) Are there any other recommended procedures for this? Thanks, Parag
Re: cassandra-stress with clustering columns?
The stress tool will work against any version of Cassandra, it's only released alongside for ease of deployment. You can safely use the tool from pre-release versions. On Tue, Aug 19, 2014 at 11:03 PM, Clint Kelly clint.ke...@gmail.com wrote: Thanks for the update, Benedict. We are still using 2.0.9 unfortunately. :/ I will keep that in mind for when we upgrade. On Tue, Aug 19, 2014 at 10:51 AM, Benedict Elliott Smith belliottsm...@datastax.com wrote: The stress tool in 2.1 also now supports clustering columns: http://www.datastax.com/dev/blog/improved-cassandra-2-1-stress-tool-benchmark-any-schema There are however some features up for revision before release in order to help generate realistic workloads. See https://issues.apache.org/jira/browse/CASSANDRA-7519 for details. On Tue, Aug 19, 2014 at 10:46 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi Mikail, This plugin looks great! I have actually been using JMeter + a custom REST endpoint driving Cassandra. It would be great to compare the results I got from that against the pure JMeter + Cassandra (to evaluate the REST endpoint's performance). Thanks! I'll check this out. Best regards, Clint On Tue, Aug 19, 2014 at 1:38 AM, Mikhail Stepura mikhail.step...@outlook.com wrote: Are you interested in cassandra-stress in particular? Or in any tool which will allow you to stress test your schema? I believe Apache Jmeter + CQL plugin may be useful in the latter case. https://github.com/Mishail/CqlJmeter -M On 8/17/14 12:26, Clint Kelly wrote: Hi all, Is there a way to use the cassandra-stress tool with clustering columns? I am trying to figure out whether an application that I'm running on is slow because of my application logic, C* data model, or underlying C* setup (e.g., I need more nodes or to tune some parameters). My application uses tables with several clustering columns and a couple of additional indices and it is running quite slowly under a heavy write load. I think that the problem is my data model (and therefore table layout), but I'd like to confirm by replicating the problem with cassandra-stress. I don't see any option for using clustering columns or secondary indices, but I wanted to check before diving into the code and trying to add this functionality. Thanks! Best regards, Clint
Re: [RELEASE CANDIDATE] Apache Cassandra 2.1.0-rc6 released
That is great news keep up the great work! Best Regards, Tony Anecito Founder/PresidentMyUniPortal LLC http://www.myuniportal.com On Tuesday, August 19, 2014 2:38 AM, Sylvain Lebresne sylv...@datastax.com wrote: The Cassandra team is pleased to announce the sixth release candidate for the future Apache Cassandra version 2.1.0. Please note that this is not yet the final 2.1.0 release and as such, it should not be considered for production use. We'd appreciate testing and let us know if you encounter any problem[3,4]. Please make sure to have a look at the change log[1] and release notes[2]. Apache Cassandra 2.1.0-rc6[5] is available as usual from the cassandra website (http://cassandra.apache.org/download/) and a debian package is available using the 21x branch (see http://wiki.apache.org/cassandra/DebianPackaging). Enjoy! [1]: http://goo.gl/MyqArD (CHANGES.txt) [2]: http://goo.gl/7vS47U (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA [4]: user@cassandra.apache.org [5]: http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-2.1.0-rc6
Re: LOCAL_QUORUM without a replica in current data center
Sorry for the spam - but I wanted to double check if anyone had experience with such a scenario. Thanks. On Sun, Aug 17, 2014 at 7:11 PM, Viswanathan Ramachandran vish.ramachand...@gmail.com wrote: Hi, How does LOCAL_QUORUM read/write behave when the data center on which query is executed does not have a replica of the keyspace? Does it result in an error or can it be configured to do LOCAL_QUORUM on the nearest data center (as depicted by the dynamic snitch) which has the replicas ? We are essentially trying to design a Cassandra cluster with a keyspace only in certain regional-hub data centers to keep number of replicas under control. I am curious to know if a cassandra node not in the regional-hub data center can handle LOCAL_QUORUM type operations, or if clients really need to have a connection to the hub data center with the replica to use that consistency level. Thanks Vish
Re: Manually deleting sstables
On Tue, Aug 19, 2014 at 8:59 AM, Parag Patel ppa...@clearpoolgroup.com wrote: After we dropped a table, we noticed that the sstables are still there. After searching through the forum history, I noticed that this is known behavior. Yes, it's providing protection in this case, though many people do not expect this. 1) Is there any negative impact of deleting the sstables off disk and then restarting Cassandra? You don't have to restart Cassandra, and no. 2) Are there any other recommended procedures for this? 0) stop writes to columnfamily 1) TRUNCATE columnfamily; 2) nodetool clearsnapshot # on the snapshot that results 3) DROP columnfamily; =Rob
Re: EC2 SSD cluster costs
Still using good ol' m1.xlarge here + external caching (memcached). Trying to adapt our use case to have different clusters for different use cases so we can leverage SSD at an acceptable cost in some of them. On Tue, Aug 19, 2014 at 1:05 PM, Shane Hansen shanemhan...@gmail.com wrote: Again, depends on your use case. But we wanted to keep the data per node below 500gb, and we found raided ssds to be the best bang for the buck for our cluster. I think we moved to from the i2 to c3 because our bottleneck tended to be CPU utilization (from parsing requests). (Discliamer, we're not cassandra veterans but we're not part of the RF=N=3 club) On Tue, Aug 19, 2014 at 10:00 AM, Russell Bradberry rbradbe...@gmail.com wrote: Short answer, it depends on your use-case. We migrated to i2.xlarge nodes and saw an immediate increase in performance. If you just need plain ole raw disk space and don’t have a performance requirement to meet then the m1 machines would work, or hell even SSD EBS volumes may work for you. The problem we were having is that we couldn’t fill the m1 machines because we needed to add more nodes for performance. Now we have much more power and just the right amount of disk space. Basically saying, these are not apples-to-apples comparisons On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com) wrote: The latest consensus around the web for running Cassandra on EC2 seems to be use new SSD instances. I've not seen any mention of the elephant in the room - using the new SSD instances significantly raises the cluster cost per TB. With Cassandra's strength being linear scalability to many terabytes of data, it strikes me as odd that everyone is recommending such a large storage cost hike almost without reservation. Monthly cost comparison for a 100TB cluster (non-reserved instances): m1.xlarge (2x420 non-SSD): $30,000 (120 nodes) m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option) i2.xlarge (1x800 SSD): $76,000 (125 nodes) Best case, the cost goes up 150%. How are others approaching these new instances? Have you migrated and eaten the costs, or are you staying on previous generation until prices come down? -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: EC2 SSD cluster costs
I completely agree with others here. It depends on your use case. We were using Hi1.4xlarge boxes and paying huge amount to Amazon, lately our requirements changed and we are not hammering C* as much and our data size has gone down too, so given the new conditions we reserved and migrated to c3.4xlarges to save quite a lot of money. On Aug 19, 2014, at 10:25 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Still using good ol' m1.xlarge here + external caching (memcached). Trying to adapt our use case to have different clusters for different use cases so we can leverage SSD at an acceptable cost in some of them. On Tue, Aug 19, 2014 at 1:05 PM, Shane Hansen shanemhan...@gmail.com wrote: Again, depends on your use case. But we wanted to keep the data per node below 500gb, and we found raided ssds to be the best bang for the buck for our cluster. I think we moved to from the i2 to c3 because our bottleneck tended to be CPU utilization (from parsing requests). (Discliamer, we're not cassandra veterans but we're not part of the RF=N=3 club) On Tue, Aug 19, 2014 at 10:00 AM, Russell Bradberry rbradbe...@gmail.com wrote: Short answer, it depends on your use-case. We migrated to i2.xlarge nodes and saw an immediate increase in performance. If you just need plain ole raw disk space and don’t have a performance requirement to meet then the m1 machines would work, or hell even SSD EBS volumes may work for you. The problem we were having is that we couldn’t fill the m1 machines because we needed to add more nodes for performance. Now we have much more power and just the right amount of disk space. Basically saying, these are not apples-to-apples comparisons On August 19, 2014 at 11:57:10 AM, Jeremy Jongsma (jer...@barchart.com) wrote: The latest consensus around the web for running Cassandra on EC2 seems to be use new SSD instances. I've not seen any mention of the elephant in the room - using the new SSD instances significantly raises the cluster cost per TB. With Cassandra's strength being linear scalability to many terabytes of data, it strikes me as odd that everyone is recommending such a large storage cost hike almost without reservation. Monthly cost comparison for a 100TB cluster (non-reserved instances): m1.xlarge (2x420 non-SSD): $30,000 (120 nodes) m3.xlarge (2x40 SSD): $250,000 (1250 nodes! Clearly not an option) i2.xlarge (1x800 SSD): $76,000 (125 nodes) Best case, the cost goes up 150%. How are others approaching these new instances? Have you migrated and eaten the costs, or are you staying on previous generation until prices come down? -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: Cassandra Wiki Immutable?
added, thanks. On 08/18/2014 06:15 AM, Otis Gospodnetic wrote: Hi, What is the state of Cassandra Wiki -- http://wiki.apache.org/cassandra ? I tried to update a few pages, but it looks like pages are immutable. Do I need to have my Wiki username (OtisGospodnetic) added to some ACL? Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/
Cassandra Consistency Level
We have cassandra cluster in three different datacenters (DC1, DC2 and DC3) and we have 10 machines in each datacenter. We have few tables in cassandra in which we have less than 100 records. What we are seeing - some tables are out of sync between machines in DC3 as compared to DC1 or DC2 when we do select count(*) on it. As an example we did select count(*) while connecting to one cassandra machine in dc3 datacenter as compared to one cassandra machine in dc1 datacenter and the results were different. root@machineA:/home/david/apache-cassandra/bin# python cqlsh dc3114.dc3.host.com Connected to TestCluster at dc3114.dc3.host.com:9160. [cqlsh 2.3.0 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol 19.36.0] Use HELP for help. cqlsh use testingkeyspace ; cqlsh:testingkeyspace select count(*) from test_metadata ; count --- 12 cqlsh:testingkeyspace exit root@machineA:/home/david/apache-cassandra/bin# python cqlsh dc18b0c.dc1.host.com Connected to TestCluster at dc18b0c.dc1.host.com:9160. [cqlsh 2.3.0 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol 19.36.0] Use HELP for help. cqlsh use testingkeyspace ; cqlsh:testingkeyspace select count(*) from test_metadata ; count --- 16 What could be the reason for this sync issue? Can anyone shed some light on this? Since our java driver code and datastax c++ driver code are using these tables with CONSISTENCY LEVEL ONE.
Re: Cassandra Consistency Level
On Tue, Aug 19, 2014 at 4:14 PM, Check Peck comptechge...@gmail.com wrote: What could be the reason for this sync issue? Can anyone shed some light on this? Since our java driver code and datastax c++ driver code are using these tables with CONSISTENCY LEVEL ONE. 1) write with CL.ONE 2) get success response to client 3) replication times out to DC3 and is queued as a hint 4) SELECT COUNT(*) in DC3 You should be able to observe the storage and delivery of the hints in 3), and 4) should eventually be correct as a result of their delivery. =Rob
updated num_tokens value while changing replication factor and getting a nodetool repair error
I have 1 DC that was originally 3 nodes each set with a single token: '-9223372036854775808', '-3074457345618258603', '3074457345618258602' I added two more nodes and ran nodetool move and nodetool cleanup one server at a time with these tokens: '-9223372036854775808', '-5534023222112865485', '-1844674407370955162', '1844674407370955161', '5534023222112865484' Everything looked good so I changed the replication factor for my keyspace from 1 to 2 and started running nodetool repair on each node. The first node ran for a while then threw an error: Repair session 8d2a1190-25aa-11e4-8a15-ff681618d551 for range (1844674407370955161,5534023222112865484] failed with error org.apache.cassandra.exceptions.RepairException: [repair #8d2a1190-25aa-11e4-8a15-ff681618d551 on PLAGIARISM/STATS, (1844674407370955161,5534023222112865484]] Validation failed in /###.###.### Since this was a temporary Column Family, I just dropped that CF and tried to run nodetool repair again. It ended up giving me the same error with a different CF. I tried nodetool cleanup and then nodetool repair again and it eventually crashed the node (Status: DN). When I restarted Cassandra, it still had the initial_token value as -9223372036854775808' but the default num_tokens value was 256. When I checked the status it showed that it had 256 tokens (while my other 4 nodes still had 1). Luckily, it chose 256 tokens that were in the existing token range for that server, so it had the same owns value. My question is three fold: 1) Is it better to use 256 vnodes and just move the other 4 servers to have 256 tokens as well or is it possible (or better) to change the tokens back to just a single token for the first server again? I did see warnings about not using shuffle and the new method is to create a new DC and move it over, but I don't have space to do this (the current DC is 31TB). I'm fine with 256 tokens that are in the original token range, so would it be ok to just not ever run the shuffle but add the 256 tokens to each server? If I should change it back to 1 token per server again, is it possible to do so w/o decommissioning and removing all existing data and restarting? With having such a large dataset, I feel that nodetool repair works better when there are more tokens (my theory is that its working in smaller chunks), is this a good reason to use 256 instead of one? 2) How can I fix the Repair Exception above? 3) Nodetool repair takes forever to run (5+ days). Is this because I have 1 token per node or is there a better way to run this? Should I set the start and end keys? I'm running Cassandra 2.0.2 Any help would be greatly appreciated. Thanks, Bryan
Re: Cassandra Consistency Level
Hi, As you are writing as CL.ONE and cqlsh by default reads at CL.ONE, there is a probability that you are reading stale data i.e. the node you have contacted for the read may not have the most recent data. If you have a higher consistency requirement, you should look at increasing your consistency level, for a more detailed look at this see: http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html If you want to continue using CL.ONE, you could look at increasing the read_repair_chance for better consistency. http://www.datastax.com/documentation/cassandra/2.0/cassandra/reference/referenceTableAttributes.html Just to verify that this is in fact a consistency issue could you run a nodetool repair on that table and run the same queries again? Mark Regards, Mark On 20 August 2014 00:14, Check Peck comptechge...@gmail.com wrote: We have cassandra cluster in three different datacenters (DC1, DC2 and DC3) and we have 10 machines in each datacenter. We have few tables in cassandra in which we have less than 100 records. What we are seeing - some tables are out of sync between machines in DC3 as compared to DC1 or DC2 when we do select count(*) on it. As an example we did select count(*) while connecting to one cassandra machine in dc3 datacenter as compared to one cassandra machine in dc1 datacenter and the results were different. root@machineA:/home/david/apache-cassandra/bin# python cqlsh dc3114.dc3.host.com Connected to TestCluster at dc3114.dc3.host.com:9160. [cqlsh 2.3.0 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol 19.36.0] Use HELP for help. cqlsh use testingkeyspace ; cqlsh:testingkeyspace select count(*) from test_metadata ; count --- 12 cqlsh:testingkeyspace exit root@machineA:/home/david/apache-cassandra/bin# python cqlsh dc18b0c.dc1.host.com Connected to TestCluster at dc18b0c.dc1.host.com:9160. [cqlsh 2.3.0 | Cassandra 1.2.9 | CQL spec 3.0.0 | Thrift protocol 19.36.0] Use HELP for help. cqlsh use testingkeyspace ; cqlsh:testingkeyspace select count(*) from test_metadata ; count --- 16 What could be the reason for this sync issue? Can anyone shed some light on this? Since our java driver code and datastax c++ driver code are using these tables with CONSISTENCY LEVEL ONE.