Re: schema change management tools
Not that I know of. I've always been really strict about dumping my schemas (to start) and keeping my changes in migration files. I don't do a ton of schema changes so I haven't had a need to really automate it. Even with MySQL I never bothered. Jon On Thu, Oct 4, 2012 at 6:27 PM, John Sanda john.sa...@gmail.com wrote: I have been looking to see if there are any schema change management tools for Cassandra. I have not come across any so far. I figured I would check to see if anyone can point me to something before I start trying to implement something on my own. I have used liquibase ( http://www.liquibase.org) for relational databases. Earlier today I tried using it with the cassandra-jdbc driver, but ran into some exceptions due to the SQL generated. I am not looking specifically for something CQL-based. Something that uses the Thrift API via CLI scripts for example would work as well. Thanks - John -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: schema change management tools
Awesome - keep me posted. Jon On Thu, Oct 4, 2012 at 6:42 PM, John Sanda john.sa...@gmail.com wrote: For the project I work on and for previous projects as well that support multiple upgrade paths, this kind of tooling is a necessity. And I would prefer to avoid duplicating effort if there is already something out there. If not though, I will be sure to post back to the list with whatever I wind up doing. On Thu, Oct 4, 2012 at 9:34 PM, Jonathan Haddad j...@jonhaddad.com wrote: Not that I know of. I've always been really strict about dumping my schemas (to start) and keeping my changes in migration files. I don't do a ton of schema changes so I haven't had a need to really automate it. Even with MySQL I never bothered. Jon On Thu, Oct 4, 2012 at 6:27 PM, John Sanda john.sa...@gmail.com wrote: I have been looking to see if there are any schema change management tools for Cassandra. I have not come across any so far. I figured I would check to see if anyone can point me to something before I start trying to implement something on my own. I have used liquibase ( http://www.liquibase.org) for relational databases. Earlier today I tried using it with the cassandra-jdbc driver, but ran into some exceptions due to the SQL generated. I am not looking specifically for something CQL-based. Something that uses the Thrift API via CLI scripts for example would work as well. Thanks - John -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Migrating data from 2 node cluster to a 3 node cluster
You should run a nodetool repair after you copy the data over. You could also use the sstable loader, which would stream the data to the proper node. On Thu, Jul 4, 2013 at 10:03 AM, srmore comom...@gmail.com wrote: We are planning to move data from a 2 node cluster to a 3 node cluster. We are planning to copy the data from the two nodes (snapshot) to the new 2 nodes and hoping that Cassandra will sync it to the third node. Will this work ? are there any other commands to run after we are done migrating, like nodetool repair. Thanks all. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: too many open files
Are you using leveled compaction? If so, what do you have the file size set at? If you're using the defaults, you'll have a ton of really small files. I believe Albert Tobey recommended using 256MB for the table sstable_size_in_mb to avoid this problem. On Sun, Jul 14, 2013 at 5:10 PM, Paul Ingalls paulinga...@gmail.com wrote: I'm running into a problem where instances of my cluster are hitting over 450K open files. Is this normal for a 4 node 1.2.6 cluster with replication factor of 3 and about 50GB of data on each node? I can push the file descriptor limit up, but I plan on having a much larger load so I'm wondering if I should be looking at something else…. Let me know if you need more info… Paul -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: CPU Bound Writes
Everything is written to the commit log. In the case of a crash, cassandra recovers by replaying the log. On Sat, Jul 20, 2013 at 9:03 AM, Mohammad Hajjat haj...@purdue.edu wrote: Patricia, Thanks for the info. So are you saying that the *whole* data is being written on disk in the commit log, not just some sort of a summary/digest? I'm writing 10MB objects and I'm noticing high latency (250 milliseconds even with ANY consistency), so I guess that explains my high delays? Thanks, Mohammad On Fri, Jul 19, 2013 at 2:17 PM, Patricia Gorla gorla.patri...@gmail.comwrote: Kanwar, This is because writes are appends to the commit log, which is stored on disk, not memory. The commit log is then flushed to the memtable (in memory), before being written to an sstable on disk. So, most of the actions in sending out a write are writing to disk. Also see: http://www.datastax.com/docs/1.2/dml/about_writes Patricia On Fri, Jul 19, 2013 at 1:05 PM, Kanwar Sangha kan...@mavenir.com wrote: “Insert-heavy workloads will actually be CPU-bound in Cassandra before being memory-bound” ** ** Can someone explain why the internals of why writes are CPU bound ? ** ** ** ** -- *Mohammad Hajjat* *Ph.D. Student* *Electrical and Computer Engineering* *Purdue University*
Re: VM dimensions for running Cassandra and Hadoop
Having just enough RAM to hold the JVM's heap generally isn't a good idea unless you're not planning on doing much with the machine. Any memory not allocated to a process will generally be put to good use serving as page cache. See here: http://en.wikipedia.org/wiki/Page_cache Jon On Tue, Jul 30, 2013 at 10:51 PM, Jan Algermissen jan.algermis...@nordsc.com wrote: Hi, thanks for the helpful replies last week. It looks as if I will deploy Cassandra on a bunch of VMs and I am now in the process of understanding what the dimensions of the VMS should be. So far, I understand the following: - I need at least 3 VMs for a minimal Cassandra setup - I should get another VM to run the Hadoop job controller or can that run on one of the Cassandra VMs - there is no point in giving the Cassandra JVMs more than 8-12 GB heap space because of GC, so it seems going beyond 16GB RAM per VM makes no sense - Each VM needs two disks, to separate commit log from data storage - I must make sure the disks are directly attached, to prevent problems when multiple nodes flush the commit log at the same time - I'll be having rather few writes and intend to hold most of the data in memory, so spinning disks are fine for the moment Does that seem reasonable? How should I plan the disk sizes and number of CPU cores? Are there any other configuration mistakes to avoid? Is there online documentation that discusses such VM sizing questions in more detail? Jan -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: CQL and undefined columns
It's advised you do not use compact storage, as it's primarily for backwards compatibility. The first of these option is COMPACT STORAGE. This option is meanly targeted towards backward compatibility with some table definition created before CQL3. But it also provides a slightly more compact layout of data on disk, though at the price of flexibility and extensibility, and for that reason is not recommended unless for the backward compatibility reason. On Wed, Jul 31, 2013 at 2:54 PM, Edward Capriolo edlinuxg...@gmail.comwrote: You should also profile what your data looks like on disk before picking a format. It may not be as efficient to use one form or the other due to extra disk overhead. On Wed, Jul 31, 2013 at 1:32 PM, Jon Ribbens jon-cassan...@unequivocal.co.uk wrote: On Wed, Jul 31, 2013 at 02:21:52PM +0200, Alain RODRIGUEZ wrote: I like to point to this article from Sylvain, which is really well written. http://www.datastax.com/dev/blog/thrift-to-cql3 Ah, thankyou, it looks like a combination of multi-column PRIMARY KEY and use of collections may well suffice for what I want. I must admit that I did not find any of this particularly obvious from the CQL documentation. By the way, http://cassandra.apache.org/doc/cql3/CQL.html#createTableStmt says A table with COMPACT STORAGE must also define at least one clustering key, which seems to contradict definition 2 in the thrift-to-cql3 document you pointed me to. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Adding my first node to another one...
I recommend you do not add 1.2 nodes to a 1.1 cluster. We tried this, and ran into many issues. Specifically, the data will not correctly stream from the 1.1 nodes to the 1.2, and it will never bootstrap correctly. On Thu, Aug 1, 2013 at 2:07 PM, Morgan Segalis msega...@gmail.com wrote: Hi Arthur, Thank you for your answer. I have read the section Adding Capacity to an Existing Cluster prior to posting my question. Actually I was thinking I would like Cassandra choose by itself the token. Since I want only some column family to be an ALL cluster, and other column family to be where they are, no matter balancing… I do not find anything on the configuration that I should make on the very first (and only node so far) to start the replication. (The configuration of my Node A is pretty basic, almost out of the box, I might changed the name) How to make this node know that it will be a Seed. My current Node A is using Cassandra 1.1.0 Is it compatible if I install a new node with Cassandra 1.2.8 ? or should I fetch 1.1.0 for Node B ? Thank you. Morgan. Le 1 août 2013 à 20:32, Arthur Zubarev arthur.zuba...@aol.com a écrit : Hi Morgan, The scaling out depends on several factors. The most intricate is perhaps calculating the tokens. Also the Cassandra version is important. At this point in time I suggest you read section Adding Capacity to an Existing Cluster at http://www.datastax.com/docs/1.0/operations/cluster_management and come back here with questions and more details. Regards, Arthur -Original Message- From: Morgan Segalis Sent: Thursday, August 01, 2013 11:24 AM To: user@cassandra.apache.org Subject: Adding my first node to another one... Hi everyone, I'm trying to wrap my head around Cassandra great ability to expand… I have set up my first Cassandra node a while ago… it was working great, and data wasn't so important back then. Since I had a great experience with Cassandra I decided to migrate step by step my MySQL data to Cassandra. Now data start to be important, so I would like to create another node, and add it. Since I had some issue with my DataCenter, I wanted to have a copy (of sensible data only) on another DataCenter. Quite frankly I'm still a newbie on Cassandra and need your guys help. First things first… Already up and Running Cassandra (Called A): - Do I need to change anything to the cassandra.yaml to make sure that another node can connect ? if yes, should I restart the node (because I would have to warn users about downtime) ? - Since this node should be a seed, the seed list is already set to localhost, is that good enough ? The new node I want to add (Called B): - I know that before starting this node, I should modify the seed list in cassandra.yaml… Is that the only thing I need to do ? It is my first time doing this, so please be gentle ;-) Thank you all, Morgan. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: CQL and undefined columns
The CQL docs recommend not using it - I didn't just make that up. :) COMPACT STORAGE imposes the limit that you can't add columns to your tables. For those of us that are heavy CQL users, this limitation is a total deal breaker. On Mon, Aug 5, 2013 at 10:27 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jul 31, 2013 at 3:10 PM, Jonathan Haddad j...@jonhaddad.comwrote: It's advised you do not use compact storage, as it's primarily for backwards compatibility. Many Apache Cassandra experts do not advise against using COMPACT STORAGE. [1] Use CQL3 non-COMPACT STORAGE if you want to, but there are also valid reasons to not use it. Asserting that there is some good reason you should not use COMPACT STORAGE (other than range ghosts?) seems inaccurate. :) =Rob [1] http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/legacy_tables -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: CQL and undefined columns
If you expected your CQL3 query to work, then I think you've missed the point of CQL completely. For many of us, adding in a query layer which gives us predictable column names, but continues to allow us to utilize wide rows on disk is a huge benefit. Why would I want to reinvent a system for structured data when the DB can handle it for me? I get a bunch of stuff for free with CQL, which decreases my development time, which is the resource that I happen to be the most bottlenecked on. Feel free to continue to use thrift's wide row structure, with ad hoc columns. No one is stopping you. On Mon, Aug 5, 2013 at 1:36 PM, Edward Capriolo edlinuxg...@gmail.comwrote: COMPACT STORAGE imposes the limit that you can't add columns to your tables. Is absolutely false. If anything CQL is imposing the limits! Simple to prove. Try something like this: create table abc (x int); insert into abc (y) values (5); and watch CQL reject the insert saying something to the effect of 'y? whats that? Did you mean CQL2 OR 1.5?, or hamburgers' Then go to the Cassandra cli and do this: create column family abd; set ['abd']['y']= '5'; set ['abd']['z']='4'; AND IT WORKS! I noticed the nomenclature starting to spring up around the term legacy tables and docs based around can't do with them. Frankly it makes me nuts because... This little known web company named google produced a white paper about what a ColumnFamily data model could do http://en.wikipedia.org/wiki/BigTable . Cassandra was build on the BigTable/ColumnFamily data model. There was also this big movement called NoSQL, where people wanted to break free of query languages and rigid schema's On Mon, Aug 5, 2013 at 1:56 PM, Jonathan Haddad j...@jonhaddad.com wrote: The CQL docs recommend not using it - I didn't just make that up. :) COMPACT STORAGE imposes the limit that you can't add columns to your tables. For those of us that are heavy CQL users, this limitation is a total deal breaker. On Mon, Aug 5, 2013 at 10:27 AM, Robert Coli rc...@eventbrite.comwrote: On Wed, Jul 31, 2013 at 3:10 PM, Jonathan Haddad j...@jonhaddad.comwrote: It's advised you do not use compact storage, as it's primarily for backwards compatibility. Many Apache Cassandra experts do not advise against using COMPACT STORAGE. [1] Use CQL3 non-COMPACT STORAGE if you want to, but there are also valid reasons to not use it. Asserting that there is some good reason you should not use COMPACT STORAGE (other than range ghosts?) seems inaccurate. :) =Rob [1] http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/legacy_tables -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: CQL and undefined columns
CQL maps a series of logical rows into a single physical row by transposing multiple rows based on partition and clustering keys into slices of a row. The point is to add a loose schema on top of a wide row which allows you to stop reimplementing common patterns. Yes, you can go in and mess with your tables via the cassandra-cli, but that's not exactly proving me wrong. You've simply removed the constraints of CQL and wrote data to the table at a lower level that didn't deal with schema enforcement. On Mon, Aug 5, 2013 at 2:37 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Feel free to continue to use thrift's wide row structure, with ad hoc columns. No one is stopping you. Thanks. I was not trying to stop you from doing it your way either. You said this: COMPACT STORAGE imposes the limit that you can't add columns to your tables. I was demonstrating you are incorrect. I then went on to point out that Cassandra is a ColumnFamily data store which was designed around big table. You could always add column dynamically because schema-less is one of the key components of a ColumnFamily datastore. I know which CQL document you are loosely referencing that implies you can not add columns to compact storage. If that were true Cassandra would have never been a ColumnFamily data store. I have found several documents which are championing CQL and its constructs, which suggest that some thing can not be done with compact storage. In reality those are short comings of the CQL language. I say this because the language can not easily accommodate the original schema system. Many applications that are already written and performing well do NOT fit well into the CQL model of non compact storage (which does not have a name by the way probably because the opposite of compact is sparse and how would SPARSE STORAGE sound?). Implying all the original stuff is legacy and you should probably avoid it is wrong. In many cases compact storage it is the best way to store things, because it is the smallest. On Mon, Aug 5, 2013 at 4:57 PM, Jonathan Haddad j...@jonhaddad.com wrote: If you expected your CQL3 query to work, then I think you've missed the point of CQL completely. For many of us, adding in a query layer which gives us predictable column names, but continues to allow us to utilize wide rows on disk is a huge benefit. Why would I want to reinvent a system for structured data when the DB can handle it for me? I get a bunch of stuff for free with CQL, which decreases my development time, which is the resource that I happen to be the most bottlenecked on. Feel free to continue to use thrift's wide row structure, with ad hoc columns. No one is stopping you. On Mon, Aug 5, 2013 at 1:36 PM, Edward Capriolo edlinuxg...@gmail.comwrote: COMPACT STORAGE imposes the limit that you can't add columns to your tables. Is absolutely false. If anything CQL is imposing the limits! Simple to prove. Try something like this: create table abc (x int); insert into abc (y) values (5); and watch CQL reject the insert saying something to the effect of 'y? whats that? Did you mean CQL2 OR 1.5?, or hamburgers' Then go to the Cassandra cli and do this: create column family abd; set ['abd']['y']= '5'; set ['abd']['z']='4'; AND IT WORKS! I noticed the nomenclature starting to spring up around the term legacy tables and docs based around can't do with them. Frankly it makes me nuts because... This little known web company named google produced a white paper about what a ColumnFamily data model could do http://en.wikipedia.org/wiki/BigTable . Cassandra was build on the BigTable/ColumnFamily data model. There was also this big movement called NoSQL, where people wanted to break free of query languages and rigid schema's On Mon, Aug 5, 2013 at 1:56 PM, Jonathan Haddad j...@jonhaddad.comwrote: The CQL docs recommend not using it - I didn't just make that up. :) COMPACT STORAGE imposes the limit that you can't add columns to your tables. For those of us that are heavy CQL users, this limitation is a total deal breaker. On Mon, Aug 5, 2013 at 10:27 AM, Robert Coli rc...@eventbrite.comwrote: On Wed, Jul 31, 2013 at 3:10 PM, Jonathan Haddad j...@jonhaddad.comwrote: It's advised you do not use compact storage, as it's primarily for backwards compatibility. Many Apache Cassandra experts do not advise against using COMPACT STORAGE. [1] Use CQL3 non-COMPACT STORAGE if you want to, but there are also valid reasons to not use it. Asserting that there is some good reason you should not use COMPACT STORAGE (other than range ghosts?) seems inaccurate. :) =Rob [1] http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/legacy_tables -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com
Re: Issue with CQLsh
My understanding is that if you want to use CQL, you should create your tables via CQL. Mixing thrift calls w/ CQL seems like it's just asking for problems like this. On Sun, Aug 25, 2013 at 6:53 PM, Vivek Mishra mishra.v...@gmail.com wrote: cassandra 1.2.4 On Mon, Aug 26, 2013 at 2:51 AM, Nate McCall n...@thelastpickle.comwrote: What version of cassandra are you using? On Sun, Aug 25, 2013 at 8:34 AM, Vivek Mishra mishra.v...@gmail.comwrote: Hi, I have created a column family using Cassandra-cli as: create column family default; and then inserted some record as: set default[1]['type']='bytes'; Then i tried to alter table it via cqlsh as: alter table default alter key type text; // it works alter table default alter column1 type text; // it goes for a toss surprisingly any command after that, simple hangs and i need to reset connection. Any suggestions? -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Cluster Management
An alternative to cssh is fabric. It's very flexible in that you can automate almost any repetitive task that you'd send to machines in a cluster, and it's written in python, meaning if you're in AWS you can mix it with boto to automate pretty much anything you want. On Thu, Aug 29, 2013 at 4:25 PM, Anthony Grasso anthony.gra...@gmail.comwrote: Hi Particia, Thank you for the feedback. It has been helpful. On Tue, Aug 27, 2013 at 12:02 AM, Patricia Gorla gorla.patri...@gmail.com wrote: Anthony, We use a number of tools to manage our Cassandra cluster. * Datastax OpsCenter [0] for at a glance information, and trending statistics. You can also run operations through here, though I prefer to use nodetool for any mutative operation. * nodetool for ad hoc status checks, and day-to-day node management. * puppet for setup and initialization For example, if I want to make some changes to the configuration file that resides on each node, is there a tool that will propagate the change to each node? For this, we use puppet to manage any changes to the configurations (which are stored in git). We initially had Cassandra auto-restart when the configuration changed, but you might not want the node to automatically join a cluster, so we turned this off. Puppet was the first thing that came to mind for us as well. In addition, we had the same thought about auto-restarting nodes when the configuration is changed. If a configuration on all the nodes is changed, we would want to restart one node at a time and wait for it to rejoin before restarting the next one. I am assuming in a case like this, you then manually perform the restart operation for each node? Another example is if I want to have a rolling repair (nodetool repair -pr) and clean up running on my cluster, is there a tool that will help manage/configure that? Multiple commands to the cluster are sent via clusterssh [1] (cssh for OS X). I can easily choose which nodes to control, and run those in sync. For any rolling procedures, we send commands one at a time, though we've considered sending some of these tasks to cron. Thanks again for the tip! This is quite interesting; it may help to solve our immediate problem for now. Regards, Anthony Hope this helps. Cheers, Patricia [0] http://planetcassandra.org/Download/DataStaxCommunityEdition [1] http://sourceforge.net/projects/clusterssh/ -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Low Row Cache Request
9/12 = .75 It's a rate, not a percentage. On Sat, Aug 31, 2013 at 2:21 PM, Sávio Teles savio.te...@lupa.inf.ufg.br wrote: I'm running one Cassandra node -version 1.2.6- and I *enabled* the *row cache* with *1GB*. But looking the Cassandra metrics on JConsole, *Row Cache Requests* are very *low* after a high number of queries (about 12 requests). RowCache metrics: *Capacity: 1GB* *Entries: 3 * *HitRate: 0.75 * *Hits: 9 * *Requests: 12 * *Size: 191630 * Something wrong? -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software Laboratory for Ubiquitous and Pervasive Applications (LUPA) - UFG
Re: Why don't you start off with a “single small” Cassandra server as you usually do it with MySQL?
For future references, a blog post on this topic. http://rustyrazorblade.com/2013/09/cassandra-faq-can-i-start-with-a-single-node/ On Wed, Sep 18, 2013 at 6:38 AM, Michał Michalski mich...@opera.com wrote: You might be interested in this: http://mail-archives.apache.**org/mod_mbox/cassandra-user/**201308.mbox/%* *3CCAEqOBHPAv25pcgjFwbkMD1RZxVr**iF94e6LpYbpJ3mU_bqN92iw@mail.** gmail.com%3Ehttp://mail-archives.apache.org/mod_mbox/cassandra-user/201308.mbox/%3ccaeqobhpav25pcgjfwbkmd1rzxvrif94e6lpybpj3mu_bqn9...@mail.gmail.com%3E M. W dniu 18.09.2013 15:34, Ertio Lew pisze: For any website just starting out, the load initially is minimal grows with a slow pace initially. People usually start with their MySQL based sites with a single server(***that too a VPS not a dedicated server) running as both app server as well as DB server usually get too far with this setup only as they feel the need they separate the DB from the app server giving it a separate VPS server. This is what a start up expects the things to be while planning about resources procurement. But so far what I have seen, it's something very different with Cassandra. People usually recommend starting out with atleast a 3 node cluster, (on dedicated servers) with lots lots of RAM. 4GB or 8GB RAM is what they suggest to start with. So is it that Cassandra requires more hardware resources in comparison to MySQL, for a website to deliver similar performance, serve similar load/ traffic same amount of data. I understand about higher storage requirements of Cassandra due to replication but what about other hardware resources ? Can't we start off with Cassandra based apps just like MySQL. Starting with 1 or 2 VPS adding more whenever there's a need ? I don't want to compare apples with oranges. I just want to know how much more dangerous situation I may be in when I start out with a single node VPS based cassandra installation Vs a single node VPS based MySQL installation. Difference between these two situations. Are cassandra servers more prone to be unavailable than MySQL servers ? What is bad if I put tomcat too along with Cassandra as people use LAMP stack on single server. - *This question is also posted at StackOverflow herehttp://stackoverflow.com/**questions/18462530/why-dont-** you-start-off-with-a-single-**small-cassandra-server-as-you-**usuallyhttp://stackoverflow.com/questions/18462530/why-dont-you-start-off-with-a-single-small-cassandra-server-as-you-usually has an open bounty worth +50 rep.* -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Choosing python client lib for Cassandra
We're currently using the cql package, which is really a wrapper around thrift. To your concern about deadlines, I'm not sure how writing raw CQL is going to be any faster than using a mapper library for anything other than the most trivial of project. On Tue, Nov 26, 2013 at 11:09 AM, Kumar Ranjan winnerd...@gmail.com wrote: Jon - Thanks. As I understand, cqlengine is an object mapper and must be using for cql prepare statements. What are you wrapping it with, in alternative to python-driver? — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone On Tue, Nov 26, 2013 at 1:19 PM, Jonathan Haddad j...@jonhaddad.comwrote: So, for cqlengine (https://github.com/cqlengine/cqlengine), we're currently using the thrift api to execute CQL until the native driver is out of beta. I'm a little biased in recommending it, since I'm one of the primary authors. If you've got cqlengine specific questions, head to the mailing list: https://groups.google.com/forum/#!forum/cqlengine-users If you want to roll your own solution, it might make sense to take an approach like we did and throw a layer on top of thrift so you don't have to do a massive rewrite of your entire app once you want to go native. Jon On Tue, Nov 26, 2013 at 9:46 AM, Kumar Ranjan winnerd...@gmail.comwrote: I have worked with Pycassa before and wrote a wrapper to use batch mutation connection pooling etc. But http://wiki.apache.org/cassandra/ClientOptions recommends now to use CQL 3 based api because Thrift based api (Pycassa) will be supported for backward compatibility only. Apache site recommends to use Python api written by DataStax which is still in Beta (As per their documentation). See warnings from their python-driver/README.rst file *Warning* This driver is currently under heavy development, so the API and layout of packages,modules, classes, and functions are subject to change. There may also be serious bugs, so usage in a production environment is *not* recommended at this time. DataStax site http://www.datastax.com/download/clientdrivers recommends using DB-API 2.0 plus legacy api's. Is there more? Has any one compared between CQL 3 based apis? Which stands out on top? Answers based on facts will help the community so please refrain from opinions. Please help ?? -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Choosing python client lib for Cassandra
cqlengine supports batch queries, see the docs here: http://cqlengine.readthedocs.org/en/latest/topics/queryset.html#batch-queries On Tue, Nov 26, 2013 at 11:53 AM, Kumar Ranjan winnerd...@gmail.com wrote: Jon - Any comment on batching? — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone On Tue, Nov 26, 2013 at 2:52 PM, Laing, Michael michael.la...@nytimes.com wrote: That's not a problem we have faced yet. On Tue, Nov 26, 2013 at 2:46 PM, Kumar Ranjan winnerd...@gmail.comwrote: How do you insert huge amount of data? — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone On Tue, Nov 26, 2013 at 2:31 PM, Laing, Michael michael.la...@nytimes.com wrote: I think thread pooling is always in operation - and we haven't seen any problems in that regard going to the 6 local nodes each client connects to. We haven't tried batching yet. On Tue, Nov 26, 2013 at 2:05 PM, Kumar Ranjan winnerd...@gmail.comwrote: Michael - thanks. Have you tried batching and thread pooling in python-driver? For now, i would avoid object mapper cqlengine, just because of my deadlines. — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone On Tue, Nov 26, 2013 at 1:52 PM, Laing, Michael michael.la...@nytimes.com wrote: We use the python-driver and have contributed some to its development. I have been careful to not push too fast on features until we need them. For example, we have just started using prepared statements - working well BTW. Next we will employ futures and start to exploit the async nature of new interface to C*. We are very familiar with libev in both C and python, and are happy to dig into the code to add features and fix bugs as needed, so the rewards of bypassing the old and focusing on the new seem worth the risks to us. ml On Tue, Nov 26, 2013 at 1:16 PM, Jonathan Haddad j...@jonhaddad.comwrote: So, for cqlengine (https://github.com/cqlengine/cqlengine), we're currently using the thrift api to execute CQL until the native driver is out of beta. I'm a little biased in recommending it, since I'm one of the primary authors. If you've got cqlengine specific questions, head to the mailing list: https://groups.google.com/forum/#!forum/cqlengine-users If you want to roll your own solution, it might make sense to take an approach like we did and throw a layer on top of thrift so you don't have to do a massive rewrite of your entire app once you want to go native. Jon On Tue, Nov 26, 2013 at 9:46 AM, Kumar Ranjan winnerd...@gmail.comwrote: I have worked with Pycassa before and wrote a wrapper to use batch mutation connection pooling etc. But http://wiki.apache.org/cassandra/ClientOptions recommends now to use CQL 3 based api because Thrift based api (Pycassa) will be supported for backward compatibility only. Apache site recommends to use Python api written by DataStax which is still in Beta (As per their documentation). See warnings from their python-driver/README.rst file *Warning* This driver is currently under heavy development, so the API and layout of packages,modules, classes, and functions are subject to change. There may also be serious bugs, so usage in a production environment is *not* recommended at this time. DataStax site http://www.datastax.com/download/clientdrivers recommends using DB-API 2.0 plus legacy api's. Is there more? Has any one compared between CQL 3 based apis? Which stands out on top? Answers based on facts will help the community so please refrain from opinions. Please help ?? -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Choosing python client lib for Cassandra
No, 2.7 only. On Tue, Nov 26, 2013 at 3:04 PM, Kumar Ranjan winnerd...@gmail.com wrote: Hi Jonathan - Does cqlengine have support for python 2.6 ? On Tue, Nov 26, 2013 at 4:17 PM, Jonathan Haddad j...@jonhaddad.comwrote: cqlengine supports batch queries, see the docs here: http://cqlengine.readthedocs.org/en/latest/topics/queryset.html#batch-queries On Tue, Nov 26, 2013 at 11:53 AM, Kumar Ranjan winnerd...@gmail.comwrote: Jon - Any comment on batching? — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone On Tue, Nov 26, 2013 at 2:52 PM, Laing, Michael michael.la...@nytimes.com wrote: That's not a problem we have faced yet. On Tue, Nov 26, 2013 at 2:46 PM, Kumar Ranjan winnerd...@gmail.comwrote: How do you insert huge amount of data? — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone On Tue, Nov 26, 2013 at 2:31 PM, Laing, Michael michael.la...@nytimes.com wrote: I think thread pooling is always in operation - and we haven't seen any problems in that regard going to the 6 local nodes each client connects to. We haven't tried batching yet. On Tue, Nov 26, 2013 at 2:05 PM, Kumar Ranjan winnerd...@gmail.comwrote: Michael - thanks. Have you tried batching and thread pooling in python-driver? For now, i would avoid object mapper cqlengine, just because of my deadlines. — Sent from Mailbox https://www.dropbox.com/mailbox for iPhone On Tue, Nov 26, 2013 at 1:52 PM, Laing, Michael michael.la...@nytimes.com wrote: We use the python-driver and have contributed some to its development. I have been careful to not push too fast on features until we need them. For example, we have just started using prepared statements - working well BTW. Next we will employ futures and start to exploit the async nature of new interface to C*. We are very familiar with libev in both C and python, and are happy to dig into the code to add features and fix bugs as needed, so the rewards of bypassing the old and focusing on the new seem worth the risks to us. ml On Tue, Nov 26, 2013 at 1:16 PM, Jonathan Haddad j...@jonhaddad.com wrote: So, for cqlengine (https://github.com/cqlengine/cqlengine), we're currently using the thrift api to execute CQL until the native driver is out of beta. I'm a little biased in recommending it, since I'm one of the primary authors. If you've got cqlengine specific questions, head to the mailing list: https://groups.google.com/forum/#!forum/cqlengine-users If you want to roll your own solution, it might make sense to take an approach like we did and throw a layer on top of thrift so you don't have to do a massive rewrite of your entire app once you want to go native. Jon On Tue, Nov 26, 2013 at 9:46 AM, Kumar Ranjan winnerd...@gmail.com wrote: I have worked with Pycassa before and wrote a wrapper to use batch mutation connection pooling etc. But http://wiki.apache.org/cassandra/ClientOptions recommends now to use CQL 3 based api because Thrift based api (Pycassa) will be supported for backward compatibility only. Apache site recommends to use Python api written by DataStax which is still in Beta (As per their documentation). See warnings from their python-driver/README.rst file *Warning* This driver is currently under heavy development, so the API and layout of packages,modules, classes, and functions are subject to change. There may also be serious bugs, so usage in a production environment is *not* recommended at this time. DataStax site http://www.datastax.com/download/clientdrivers recommends using DB-API 2.0 plus legacy api's. Is there more? Has any one compared between CQL 3 based apis? Which stands out on top? Answers based on facts will help the community so please refrain from opinions. Please help ?? -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: cassandra performance problems
Do you mean high CPU usage or high load avg? (20 indicates load avg to me). High load avg means the CPU is waiting on something. Check iostat -dmx 1 100 to check your disk stats, you'll see the columns that indicate mb/s read write as well as % utilization. Once you understand the bottleneck we can start to narrow down the cause. On Thu, Dec 5, 2013 at 4:33 AM, Alexander Shutyaev shuty...@gmail.comwrote: Hi all, We have a 3 node cluster setup, single keyspace, about 500 tables. The hardware is 2 cores + 16 GB RAM (Cassandra chose to have 4GB). Cassandra version is 2.0.3. Our replication factor is 3, read/write consistency is QUORUM. We've plugged it into our production environment as a cache in front of postgres. Everything worked fine, we even stressed it by explicitly propagating about 30G (10G/node) data from postgres to cassandra. Then the problems came. Our nodes began showing high cpu usage (around 20). The funny thing is that they were actually doing it one after another and there was always only node with high cpu usage. Using OpsCenter we saw that when the CPU was beginning to go high the node in question was performing compaction. But even after the compaction was performed the cpu remained still high, and in some cases didn't go down for hours. Our jmx monitoring showed that it was presumably in constant garbage collection. During that time cluster read latency goes from 2ms to 200ms What can be the reason? Can it be high number of tables? Do we need to adjust some settings for this setup? Is it ok to have so many tables? Theoretically we can stuck them all in 3-4 tables. Thanks in advance, Alexander -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
new project - Under Siege
I've recently pushed up a new project to github, which we've named Under Siege. It's a java agent for reporting Cassandra metrics to statsd. We've in the process of deploying it to our production clusters. Tested against Cassandra 1.2.11. The metrics library seems to change on every release of C*, so I'm not sure what'll happen if you deploy against a different version. Might need to mvn package against the same version of metrics. https://github.com/StartTheShift/UnderSiege I'm not much of a Java programmer so there's probably about a hundred things I could have done better. Pull requests welcome. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: cassandra backup
I believe SSTables are written to a temporary file then moved. If I remember correctly, tools like tablesnap listen for the inotify event IN_MOVED_TO. This should handle the try to back up sstable while in mid-write issue. On Fri, Dec 6, 2013 at 5:39 AM, Michael Theroux mthero...@yahoo.com wrote: Hi Marcelo, Cassandra provides and eventually consistent model for backups. You can do staggered backups of data, with the idea that if you restore a node, and then do a repair, your data will be once again consistent. Cassandra will not automatically copy the data to other nodes (other than via hinted handoff). You should manually run repair after restoring a node. You should take snapshots when doing a backup, as it keeps the data you are backing up relevant to a single point in time, otherwise compaction could add/delete files one you mid-backup, or worse, I imagine attempt to access a SSTable mid-write. Snapshots work by using links, and don't take additional storage to perform. In our process we create the snapshot, perform the backup, and then clear the snapshot. One thing to keep in mind in your S3 cost analysis is that, even though storage is cheap, reads/writes to S3 are not (especially writes). If you are using LeveledCompaction, or otherwise have a ton of SSTables, some people have encountered increased costs moving the data to S3. Ourselves, we maintain backup EBS volumes that we regularly snaphot/rsync data too. Thus far this has worked very well for us. -Mike On Friday, December 6, 2013 8:14 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hello everyone, I am trying to create backups of my data on AWS. My goal is to store the backups on S3 or glacier, as it's cheap to store this kind of data. So, if I have a cluster with N nodes, I would like to copy data from all N nodes to S3 and be able to restore later. I know Priam does that (we were using it), but I am using the latest cassandra version and we plan to use DSE some time, I am not sure Priam fits this case. I took a look at the docs: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/operations/../../cassandra/operations/ops_backup_takes_snapshot_t.html And I am trying to understand if it's really needed to take a snapshot to create my backup. Suppose I do a flush and copy the sstables from each node, 1 by one, to s3. Not all at the same time, but one by one. When I try to restore my backup, data from node 1 will be older than data from node 2. Will this cause problems? AFAIK, if I am using a replication factor of 2, for instance, and Cassandra sees data from node X only, it will automatically copy it to other nodes, right? Is there any chance of cassandra nodes become corrupt somehow if I do my backups this way? Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Cassandra ring not behaving like a ring
Please include the output of nodetool ring, otherwise no one can help you. On Thu, Jan 16, 2014 at 12:45 PM, Narendra Sharma narendra.sha...@gmail.com wrote: Any pointers? I am planning to do rolling restart of the cluster nodes to see if it will help. On Jan 15, 2014 2:59 PM, Narendra Sharma narendra.sha...@gmail.com wrote: RF=3. On Jan 15, 2014 1:18 PM, Andrey Ilinykh ailin...@gmail.com wrote: what is the RF? What does nodetool ring show? On Wed, Jan 15, 2014 at 1:03 PM, Narendra Sharma narendra.sha...@gmail.com wrote: Sorry for the odd subject but something is wrong with our cassandra ring. We have a 9 node ring as below. N1 - UP/NORMAL N2 - UP/NORMAL N3 - UP/NORMAL N4 - UP/NORMAL N5 - UP/NORMAL N6 - UP/NORMAL N7 - UP/NORMAL N8 - UP/NORMAL N9 - UP/NORMAL Using random partitioner and simple snitch. Cassandra 1.1.6 in AWS. I added a new node with token that is exactly in middle of N6 and N7. So the ring displayed as following N1 - UP/NORMAL N2 - UP/NORMAL N3 - UP/NORMAL N4 - UP/NORMAL N5 - UP/NORMAL N6 - UP/NORMAL N6.5 - UP/JOINING N7 - UP/NORMAL N8 - UP/NORMAL N9 - UP/NORMAL I noticed that N6.5 is streaming from N1, N2, N6 and N7. I expect it to steam from (worst case) N5, N6, N7, N8. What could potentially cause the node to get confused about the ring? -- Narendra Sharma Software Engineer *http://www.aeris.com http://www.aeris.com* *http://narendrasharma.blogspot.com/ http://narendrasharma.blogspot.com/* -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Cassandra ring not behaving like a ring
It depends on a lot of factors. If you've got all your machines in a single rack, probably not. But if you want to spread your data across multiple racks or availability zones in AWS, it makes a huge difference. On Thu, Jan 16, 2014 at 2:05 PM, Yogi Nerella ynerella...@gmail.com wrote: Hi, I am new to Cassandra Environment, does the order of the ring matter, as long as the member joins the group? Yogi On Thu, Jan 16, 2014 at 12:49 PM, Jonathan Haddad j...@jonhaddad.comwrote: Please include the output of nodetool ring, otherwise no one can help you. On Thu, Jan 16, 2014 at 12:45 PM, Narendra Sharma narendra.sha...@gmail.com wrote: Any pointers? I am planning to do rolling restart of the cluster nodes to see if it will help. On Jan 15, 2014 2:59 PM, Narendra Sharma narendra.sha...@gmail.com wrote: RF=3. On Jan 15, 2014 1:18 PM, Andrey Ilinykh ailin...@gmail.com wrote: what is the RF? What does nodetool ring show? On Wed, Jan 15, 2014 at 1:03 PM, Narendra Sharma narendra.sha...@gmail.com wrote: Sorry for the odd subject but something is wrong with our cassandra ring. We have a 9 node ring as below. N1 - UP/NORMAL N2 - UP/NORMAL N3 - UP/NORMAL N4 - UP/NORMAL N5 - UP/NORMAL N6 - UP/NORMAL N7 - UP/NORMAL N8 - UP/NORMAL N9 - UP/NORMAL Using random partitioner and simple snitch. Cassandra 1.1.6 in AWS. I added a new node with token that is exactly in middle of N6 and N7. So the ring displayed as following N1 - UP/NORMAL N2 - UP/NORMAL N3 - UP/NORMAL N4 - UP/NORMAL N5 - UP/NORMAL N6 - UP/NORMAL N6.5 - UP/JOINING N7 - UP/NORMAL N8 - UP/NORMAL N9 - UP/NORMAL I noticed that N6.5 is streaming from N1, N2, N6 and N7. I expect it to steam from (worst case) N5, N6, N7, N8. What could potentially cause the node to get confused about the ring? -- Narendra Sharma Software Engineer *http://www.aeris.com http://www.aeris.com* *http://narendrasharma.blogspot.com/ http://narendrasharma.blogspot.com/* -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Recommended OS
I just would advise against it because it's going to be difficult to narrow down what's causing problems. For instance, if you have Node A which is performing GC, it will affect query times on Node B which is trying to satisfy a quorum read. Node B might actually have very low load, and it will be difficult to understand why it's queries are responding slowly. Meanwhile, Node A, during the GC pause, will have no disk activity, and most of the CPUs will not be fully utilized. I'm not saying it's impossible to do this, but I will say you better have a really great understanding of every single OS in your cluster. It's generally hard to find people who are experts in Linux, Windows, and BeOS. Of course, if you want to ride that train, you'd probably have a great blog post. My guess is it'll end in our recommendation is 'don't do this' On Wed, Feb 12, 2014 at 2:36 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Feb 12, 2014 at 1:25 PM, Ben Bromhead b...@instaclustr.com wrote: If you are super keen on running on something different from linux in production (after all the warnings), run most of your cluster on linux, then run a single node or a separate DC with SmartOS, Solaris, BeOS, OS/2, Minix, Windows 3.1 or whatever it is that you choose and let us know how it all goes! My understanding is that running a mixed OS cluster is not officially supported. I could be wrong, but don't think I am. :) =Rob -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
abusing cassandra's multi DC abilities
Upfront TLDR: We want to do stuff (reindex documents, bust cache) when changed data from DC1 shows up in DC2. Full Story: We're planning on adding data centers throughout the US. Our platform is used for business communications. Each DC currently utilizes elastic search and redis. A message can be sent from one user to another, and the intent is that it would be seen in near-real-time. This means that 2 people may be using different data centers, and the messages need to propagate from one to the other. On the plus side, we know we get this with Cassandra (fist pump) but the other pieces, not so much. Even if they did work, there's all sorts of race conditions that could pop up from having different pieces of our architecture communicating over different channels. From this, we've arrived at the idea that since Cassandra is the authoritative data source, we might be able to trigger events in DC2 based on activity coming through either the commit log or some other means. One idea was to use a CF with a low gc time as a means of transporting messages between DCs, and watching the commit logs for deletes to that CF in order to know when we need to do things like reindex a document (or a new document), bust cache, etc. Facebook did something similar with their modifications to MySQL to include cache keys in the replication log. Assuming this is sane, I'd want to avoid having the same event register on 3 servers, thus registering 3 items in the queue when only one should be there. So, for any piece of data replicated from the other DC, I'd need a way to determine if it was supposed to actually trigger the event or not. (Maybe it looks at the token and determines if the current server falls in the token range?) Or is there a better way? So, my questions to all ye Cassandra users: 1. Is this is even sane? 2. Is anyone doing it? -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
abusing cassandra's multi DC abilities
Upfront TLDR: We want to do stuff (reindex documents, bust cache) when changed data from DC1 shows up in DC2. Full Story: We're planning on adding data centers throughout the US. Our platform is used for business communications. Each DC currently utilizes elastic search and redis. A message can be sent from one user to another, and the intent is that it would be seen in near-real-time. This means that 2 people may be using different data centers, and the messages need to propagate from one to the other. On the plus side, we know we get this with Cassandra (fist pump) but the other pieces, not so much. Even if they did work, there's all sorts of race conditions that could pop up from having different pieces of our architecture communicating over different channels. From this, we've arrived at the idea that since Cassandra is the authoritative data source, we might be able to trigger events in DC2 based on activity coming through either the commit log or some other means. One idea was to use a CF with a low gc time as a means of transporting messages between DCs, and watching the commit logs for deletes to that CF in order to know when we need to do things like reindex a document (or a new document), bust cache, etc. Facebook did something similar with their modifications to MySQL to include cache keys in the replication log. Assuming this is sane, I'd want to avoid having the same event register on 3 servers, thus registering 3 items in the queue when only one should be there. So, for any piece of data replicated from the other DC, I'd need a way to determine if it was supposed to actually trigger the event or not. (Maybe it looks at the token and determines if the current server falls in the token range?) Or is there a better way? So, my questions to all ye Cassandra users: 1. Is this is even sane? 2. Is anyone doing it? -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: abusing cassandra's multi DC abilities
Thanks for the input Todd. I've considered a few of the options you've listed. I've ruled out redis because it's not really built for multi DC. I've got nothing against XMPP, or SQS. However, they introduce race conditions as well as all sorts of edge cases (missed messages, for instance). Since Cassandra is the source of truth, why not piggyback a useful message within the true source of data itself? On Mon, Feb 24, 2014 at 8:49 PM, Todd Fast t...@digitalexistence.comwrote: Hi Jonathan-- First, best wishes for success with your platform. Frankly, I think the architecture you described is only going to cause you major trouble. I'm left wondering why you don't either use something like XMPP (of which several implementations can handle this kind of federated scenario) or simply have internal (REST) APIs to send a message from the backend in one DC to the backend in another DC. There are a bunch of ways to approach this problem: You could also use Redis pubsub (though a bit brittle), SQS, or any number of other approaches that would be simpler and more robust than what you described. I'd urge you to really consider another approach. Best, Todd On Saturday, February 22, 2014, Jonathan Haddad j...@jonhaddad.com wrote: Upfront TLDR: We want to do stuff (reindex documents, bust cache) when changed data from DC1 shows up in DC2. Full Story: We're planning on adding data centers throughout the US. Our platform is used for business communications. Each DC currently utilizes elastic search and redis. A message can be sent from one user to another, and the intent is that it would be seen in near-real-time. This means that 2 people may be using different data centers, and the messages need to propagate from one to the other. On the plus side, we know we get this with Cassandra (fist pump) but the other pieces, not so much. Even if they did work, there's all sorts of race conditions that could pop up from having different pieces of our architecture communicating over different channels. From this, we've arrived at the idea that since Cassandra is the authoritative data source, we might be able to trigger events in DC2 based on activity coming through either the commit log or some other means. One idea was to use a CF with a low gc time as a means of transporting messages between DCs, and watching the commit logs for deletes to that CF in order to know when we need to do things like reindex a document (or a new document), bust cache, etc. Facebook did something similar with their modifications to MySQL to include cache keys in the replication log. Assuming this is sane, I'd want to avoid having the same event register on 3 servers, thus registering 3 items in the queue when only one should be there. So, for any piece of data replicated from the other DC, I'd need a way to determine if it was supposed to actually trigger the event or not. (Maybe it looks at the token and determines if the current server falls in the token range?) Or is there a better way? So, my questions to all ye Cassandra users: 1. Is this is even sane? 2. Is anyone doing it? -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Cassandra Snapshots giving me corrupted SSTables in the logs
I have a nagging memory of reading about issues with virtualization and not actually having durable versions of your data even after an fsync (within the VM). Googling around lead me to this post: http://petercai.com/virtualization-is-bad-for-database-integrity/ It's possible you're hitting this issue, with with the virtualization layer, or with EBS itself. Just a shot in the dark though, other people would likely know much more than I. On Fri, Mar 28, 2014 at 12:50 PM, Russ Lavoie ussray...@yahoo.com wrote: Robert, That is what I thought as well. But apparently something is happening. The only way I can get away with doing this is adding a sleep 60 right after the nodetool snapshot is executed. I can reproduce this 100% of the time by not issuing a sleep after nodetool snapshot. This is the error. ERROR [SSTableBatchOpen:1] 2014-03-28 17:08:14,290 CassandraDaemon.java (line 191) Exception in thread Thread[SSTableBatchOpen:1,5,main] org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException at org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:108) at org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63) at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42) at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:407) at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:198) at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157) at org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.EOFException at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340) at java.io.DataInputStream.readUTF(DataInputStream.java:589) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:83) ... 11 more On Friday, March 28, 2014 2:38 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Mar 28, 2014 at 12:21 PM, Russ Lavoie ussray...@yahoo.comwrote: Thank you for your quick response. Is there a way to tell when a snapshot is completely done? IIRC, the JMX call blocks until the snapshot completes. It should be done when nodetool returns. =Rob -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Cassandra Snapshots giving me corrupted SSTables in the logs
I will +1 the recommendation on using tablesnap over EBS. S3 is at least predictable. Additionally, from a practical standpoint, you may want to back up your sstables somewhere. If you use S3, it's easy to pull just the new tables out via aws-cli tools (s3 sync), to your remote, non-aws server, and not incur the overhead of routinely backing up the entire dataset. For a non trivial database, this matters quite a bit. On Fri, Mar 28, 2014 at 1:21 PM, Laing, Michael michael.la...@nytimes.comwrote: As I tried to say, EBS snapshots require much care or you get corruption such as you have encountered. Does Cassandra quiesce the file system after a snapshot using fsfreeze or xfs_freeze? Somehow I doubt it... On Fri, Mar 28, 2014 at 4:17 PM, Jonathan Haddad j...@jonhaddad.comwrote: I have a nagging memory of reading about issues with virtualization and not actually having durable versions of your data even after an fsync (within the VM). Googling around lead me to this post: http://petercai.com/virtualization-is-bad-for-database-integrity/ It's possible you're hitting this issue, with with the virtualization layer, or with EBS itself. Just a shot in the dark though, other people would likely know much more than I. On Fri, Mar 28, 2014 at 12:50 PM, Russ Lavoie ussray...@yahoo.comwrote: Robert, That is what I thought as well. But apparently something is happening. The only way I can get away with doing this is adding a sleep 60 right after the nodetool snapshot is executed. I can reproduce this 100% of the time by not issuing a sleep after nodetool snapshot. This is the error. ERROR [SSTableBatchOpen:1] 2014-03-28 17:08:14,290 CassandraDaemon.java (line 191) Exception in thread Thread[SSTableBatchOpen:1,5,main] org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException at org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:108) at org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63) at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42) at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:407) at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:198) at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157) at org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.EOFException at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340) at java.io.DataInputStream.readUTF(DataInputStream.java:589) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:83) ... 11 more On Friday, March 28, 2014 2:38 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Mar 28, 2014 at 12:21 PM, Russ Lavoie ussray...@yahoo.comwrote: Thank you for your quick response. Is there a way to tell when a snapshot is completely done? IIRC, the JMX call blocks until the snapshot completes. It should be done when nodetool returns. =Rob -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Cassandra Snapshots giving me corrupted SSTables in the logs
Another thing to keep in mind is that if you are hitting the issue I described, waiting 60 seconds will not absolutely solve your problem, it will only make it less likely to occur. If a memtable has been partially flushed at the 60 second mark you will end up with the same corrupt sstable. On Fri, Mar 28, 2014 at 1:32 PM, Laing, Michael michael.la...@nytimes.comwrote: +1 for tablesnap On Fri, Mar 28, 2014 at 4:28 PM, Jonathan Haddad j...@jonhaddad.comwrote: I will +1 the recommendation on using tablesnap over EBS. S3 is at least predictable. Additionally, from a practical standpoint, you may want to back up your sstables somewhere. If you use S3, it's easy to pull just the new tables out via aws-cli tools (s3 sync), to your remote, non-aws server, and not incur the overhead of routinely backing up the entire dataset. For a non trivial database, this matters quite a bit. On Fri, Mar 28, 2014 at 1:21 PM, Laing, Michael michael.la...@nytimes.com wrote: As I tried to say, EBS snapshots require much care or you get corruption such as you have encountered. Does Cassandra quiesce the file system after a snapshot using fsfreeze or xfs_freeze? Somehow I doubt it... On Fri, Mar 28, 2014 at 4:17 PM, Jonathan Haddad j...@jonhaddad.comwrote: I have a nagging memory of reading about issues with virtualization and not actually having durable versions of your data even after an fsync (within the VM). Googling around lead me to this post: http://petercai.com/virtualization-is-bad-for-database-integrity/ It's possible you're hitting this issue, with with the virtualization layer, or with EBS itself. Just a shot in the dark though, other people would likely know much more than I. On Fri, Mar 28, 2014 at 12:50 PM, Russ Lavoie ussray...@yahoo.comwrote: Robert, That is what I thought as well. But apparently something is happening. The only way I can get away with doing this is adding a sleep 60 right after the nodetool snapshot is executed. I can reproduce this 100% of the time by not issuing a sleep after nodetool snapshot. This is the error. ERROR [SSTableBatchOpen:1] 2014-03-28 17:08:14,290 CassandraDaemon.java (line 191) Exception in thread Thread[SSTableBatchOpen:1,5,main] org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException at org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:108) at org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63) at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42) at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:407) at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:198) at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157) at org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.EOFException at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340) at java.io.DataInputStream.readUTF(DataInputStream.java:589) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:83) ... 11 more On Friday, March 28, 2014 2:38 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Mar 28, 2014 at 12:21 PM, Russ Lavoie ussray...@yahoo.comwrote: Thank you for your quick response. Is there a way to tell when a snapshot is completely done? IIRC, the JMX call blocks until the snapshot completes. It should be done when nodetool returns. =Rob -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Tune cache MB settings per table.
I think of all the areas you could spend your time, this will have the least returns. The OS will keep the most frequently used data in memory. There's no reason to require cassandra to do it. If you're curious as to what's been loaded into ram, try Al Tobey's pcstat utility. https://github.com/tobert/pcstat On Sun, Jun 1, 2014 at 4:30 PM, Colin colpcl...@gmail.com wrote: Have you been unable to achieve your SLA's using Cassandra out of the box so far? Based upon my experience, trying to tune Cassandra before the app is done and without simulating real world load patterns, you might actually be doing yourself a disservice. -- Colin 320-221-9531 On Jun 1, 2014, at 6:08 PM, Kevin Burton bur...@spinn3r.com wrote: Not in our experience… We've been using fadvise don't need to purge pages that aren't necessary any longer. Of course YMMV based on your usage. I tend to like to control everything explicitly instead of having magic. That's worked out very well for us in the past so it would be nice to still have this on cassandra. On Sun, Jun 1, 2014 at 12:53 PM, Colin co...@clark.ws wrote: The OS should handle this really well as long as your on v3 linux kernel -- *Colin Clark* +1-320-221-9531 On Jun 1, 2014, at 2:49 PM, Kevin Burton bur...@spinn3r.com wrote: It's possible to set caching to: all, keys_only, rows_only, or none .. for a given table. But we have one table which is MASSIVE and we only need the most recent 4-8 hours in memory. Anything older than that can go to disk as the queries there are very rare. … but I don't think cassandra can do this (which is a shame). Another option is to partition our tables per hour… then tell the older tables to cache 'none'… I hate this option though. A smarter mechanism would be to have a compaction strategy that created an SSTable for every hour and then had custom caching settings for that table. The additional upside for this is that TTLs would just drop the older data in the compactor.. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Customized Compaction Strategy: Dev Questions
I'd suggest creating 1 table per day, and dropping the tables you don't need once you're done. On Wed, Jun 4, 2014 at 10:44 AM, Redmumba redmu...@gmail.com wrote: Sorry, yes, that is what I was looking to do--i.e., create a TopologicalCompactionStrategy or similar. On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry rbradbe...@gmail.com wrote: Maybe I’m misunderstanding something, but what makes you think that running a major compaction every day will cause they data from January 1st to exist in only one SSTable and not have data from other days in the SSTable as well? Are you talking about making a new compaction strategy that creates SSTables by day? On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote: Let's say I run a major compaction every day, so that the oldest sstable contains only the data for January 1st. Assuming all the nodes are in-sync and have had at least one repair run before the table is dropped (so that all information for that time period is the same), wouldn't it be safe to assume that the same data would be dropped on all nodes? There might be a period when the compaction is running where different nodes might have an inconsistent view of just that days' data (in that some would have it and others would not), but the cluster would still function and become eventually consistent, correct? Also, if the entirety of the sstable is being dropped, wouldn't the tombstones be removed with it? I wouldn't be concerned with individual rows and columns, and this is a write-only table, more or less--the only deletes that occur in the current system are to delete the old data. On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry rbradbe...@gmail.com wrote: I’m not sure what you want to do is feasible. At a high level I can see you running into issues with RF etc. The SSTables node to node are not identical, so if you drop a full SSTable on one node there is no one corresponding SSTable on the adjacent nodes to drop.You would need to choose data to compact out, and ensure it is removed on all replicas as well. But if your problem is that you’re low on disk space then you probably won’t be able to write out a new SSTable with the older information compacted out. Also, there is more to an SSTable than just data, the SSTable could have tombstones and other relics that haven’t been cleaned up from nodes coming or going. On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote: Thanks, Russell--yes, a similar concept, just applied to sstables. I'm assuming this would require changes to both major compactions, and probably GC (to remove the old tables), but since I'm not super-familiar with the C* internals, I wanted to make sure it was feasible with the current toolset before I actually dived in and started tinkering. Andrew On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry rbradbe...@gmail.com wrote: hmm, I see. So something similar to Capped Collections in MongoDB. On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote: Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable rather than simply run out of space. The problem with using TTLs is that I have to try and guess how much data is being put in--since this is auditing data, the usage can vary wildly depending on time of year, verbosity of auditing, etc.. I'd like to maximize the disk space--not optimize the cleanup process. Andrew On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry rbradbe...@gmail.com wrote: You mean this: https://issues.apache.org/jira/browse/CASSANDRA-5228 ? On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote: Good morning! I've asked (and seen other people ask) about the ability to drop old sstables, basically creating a FIFO-like clean-up process. Since we're using Cassandra as an auditing system, this is particularly appealing to us because it means we can maximize the amount of auditing data we can keep while still allowing Cassandra to clear old data automatically. My idea is this: perform compaction based on the range of dates available in the sstable (or just metadata about when it was created). For example, a major compaction could create a combined sstable per day--so that, say, 60 days of data after a major compaction would contain 60 sstables. My question then is, will this be possible by simply implementing a separate AbstractCompactionStrategy? Does this sound feasilble at all? Based on the implementation of Size and Leveled strategies, it looks like I would have the ability to control what and how things get compacted, but I wanted to verify before putting time into it. Thank you so much for your time! Andrew -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Bad Request: Type error: cannot assign result of function token (type bigint) to id (type int)
You should read through the token docs, it has examples and specifications: http://cassandra.apache.org/doc/cql3/CQL.html#tokenFun On Thu, Jun 5, 2014 at 10:22 PM, Kevin Burton bur...@spinn3r.com wrote: I'm building a new schema which I need to read externally by paging through the result set. My understanding from reading the documentation , and this list, is that I can do that but I need to use the token() function. Only it doesn't work. Here's a reduction: create table test_paging ( id int, primary key(id) ); insert into test_paging (id) values (1); insert into test_paging (id) values (2); insert into test_paging (id) values (3); insert into test_paging (id) values (4); insert into test_paging (id) values (5); select * from test_paging where id token(0); … but it gives me: Bad Request: Type error: cannot assign result of function token (type bigint) to id (type int) … What's that about? I can't find any documentation for this and there aren't any concise examples. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Bad Request: Type error: cannot assign result of function token (type bigint) to id (type int)
Sorry, the datastax docs are actually a bit better: http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html Jon On Thu, Jun 5, 2014 at 10:46 PM, Jonathan Haddad j...@jonhaddad.com wrote: You should read through the token docs, it has examples and specifications: http://cassandra.apache.org/doc/cql3/CQL.html#tokenFun On Thu, Jun 5, 2014 at 10:22 PM, Kevin Burton bur...@spinn3r.com wrote: I'm building a new schema which I need to read externally by paging through the result set. My understanding from reading the documentation , and this list, is that I can do that but I need to use the token() function. Only it doesn't work. Here's a reduction: create table test_paging ( id int, primary key(id) ); insert into test_paging (id) values (1); insert into test_paging (id) values (2); insert into test_paging (id) values (3); insert into test_paging (id) values (4); insert into test_paging (id) values (5); select * from test_paging where id token(0); … but it gives me: Bad Request: Type error: cannot assign result of function token (type bigint) to id (type int) … What's that about? I can't find any documentation for this and there aren't any concise examples. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: VPC AWS
This may not help you with the migration, but it may with maintenance management. I just put up a blog post on managing VPC security groups with a tool I open sourced at my previous company. If you're going to have different VPCs (staging / prod), it might help with managing security groups. http://rustyrazorblade.com/2014/06/an-introduction-to-roadhouse/ Semi shameless plug... but relevant. On Thu, Jun 5, 2014 at 12:01 PM, Aiman Parvaiz ai...@shift.com wrote: Cool, thanks again for this. On Thu, Jun 5, 2014 at 11:51 AM, Michael Theroux mthero...@yahoo.com wrote: You can have a ring spread across EC2 and the public subnet of a VPC. That is how we did our migration. In our case, we simply replaced the existing EC2 node with a new instance in the public VPC, restored from a backup taken right before the switch. -Mike -- *From:* Aiman Parvaiz ai...@shift.com *To:* Michael Theroux mthero...@yahoo.com *Cc:* user@cassandra.apache.org user@cassandra.apache.org *Sent:* Thursday, June 5, 2014 2:39 PM *Subject:* Re: VPC AWS Thanks for this info Michael. As far as restoring node in public VPC is concerned I was thinking ( and I might be wrong here) if we can have a ring spread across EC2 and public subnet of a VPC, this way I can simply decommission nodes in Ec2 as I gradually introduce new nodes in public subnet of VPC and I will end up with a ring in public subnet and then migrate them from public to private in a similar way may be. If anyone has any experience/ suggestions with this please share, would really appreciate it. Aiman On Thu, Jun 5, 2014 at 10:37 AM, Michael Theroux mthero...@yahoo.com wrote: The implementation of moving from EC2 to a VPC was a bit of a juggling act. Our motivation was two fold: 1) We were running out of static IP addresses, and it was becoming increasingly difficult in EC2 to design around limiting the number of static IP addresses to the number of public IP addresses EC2 allowed 2) VPC affords us an additional level of security that was desirable. However, we needed to consider the following limitations: 1) By default, you have a limited number of available public IPs for both EC2 and VPC. 2) AWS security groups need to be configured to allow traffic for Cassandra to/from instances in EC2 and the VPC. You are correct at the high level that the migration goes from EC2-Public VPC (VPC with an Internet Gateway)-Private VPC (VPC with a NAT). The first phase was moving instances to the public VPC, setting broadcast and seeds to the public IPs we had available. Basically: 1) Take down a node, taking a snapshot for a backup 2) Restore the node on the public VPC, assigning it to the correct security group, manually setting the seeds to other available nodes 3) Verify the cluster can communicate 4) Repeat Realize the NAT instance on the private subnet will also require a public IP. What got really interesting is that near the end of the process we ran out of available IPs, requiring us to switch the final node that was on EC2 directly to the private VPC (and taking down two nodes at once, which our setup allowed given we had 6 nodes with an RF of 3). What we did, and highly suggest for the switch, is to write down every step that has to happen on every node during the switch. In our case, many of the moved nodes required slightly different configurations for items like the seeds. Its been a couple of years, so my memory on this maybe a little fuzzy :) -Mike -- *From:* Aiman Parvaiz ai...@shift.com *To:* user@cassandra.apache.org; Michael Theroux mthero...@yahoo.com *Sent:* Thursday, June 5, 2014 12:55 PM *Subject:* Re: VPC AWS Michael, Thanks for the response, I am about to head in to something very similar if not exactly same. I envision things happening on the same lines as you mentioned. I would be grateful if you could please throw some more light on how you went about switching cassandra nodes from public subnet to private with out any downtime. I have not started on this project yet, still in my research phase. I plan to have a ec2+public VPC cluster and then decomission ec2 nodes to have everything in public subnet, next would be to move it to private subnet. Thanks On Thu, Jun 5, 2014 at 8:14 AM, Michael Theroux mthero...@yahoo.com wrote: We personally use the EC2Snitch, however, we don't have the multi-region requirements you do, -Mike -- *From:* Alain RODRIGUEZ arodr...@gmail.com *To:* user@cassandra.apache.org *Sent:* Thursday, June 5, 2014 9:14 AM *Subject:* Re: VPC AWS I think you can define VPC subnet to be public (to have public + private IPs) or private only. Any insight regarding snitches ? What snitch do you guys use ? 2014-06-05 15:06 GMT+02:00 William Oberman ober...@civicscience.com: I don't think traffic will flow between classic ec2 and vpc
Re: Best way to do a multi_get using CQL
Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 }; USE Identification1; CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)); And I use the following select to query it: SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) Is this an anti-pattern? If not using SELECT IN, which other way would you recomend for lookups like that? I have several values I would like to search in cassandra and they might not be in the same particion, as above. Is Cassandra the wrong tool for lookups like that? Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Best way to do a multi_get using CQL
If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 }; USE Identification1; CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)); And I use the following select to query it: SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) Is this an anti-pattern? If not using SELECT IN, which other way would you recomend for lookups like that? I have several values I would like to search in cassandra and they might not be in the same particion, as above. Is Cassandra the wrong tool for lookups like that? Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Best way to do a multi_get using CQL
The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'DC1' : 3 }; USE Identification1; CREATE TABLE IF NOT EXISTS entity_lookup ( name varchar, value varchar, entity_id uuid, PRIMARY KEY ((name, value), entity_id)); And I use the following select to query it: SELECT entity_id FROM entity_lookup WHERE name=%s and value in(%s) Is this an anti-pattern? If not using SELECT IN, which other way would you recomend for lookups like that? I have several values I would like to search in cassandra and they might not be in the same particion, as above. Is Cassandra the wrong tool for lookups like that? Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype
Re: Best way to do a multi_get using CQL
, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Yes, I am using the CQL datastax drivers. It was a good advice, thanks a lot Janathan. []s 2014-06-20 0:28 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: The only case in which it might be better to use an IN clause is if the entire query can be satisfied from that machine. Otherwise, go async. The native driver reuses connections and intelligently manages the pool for you. It can also multiplex queries over a single connection. I am assuming you're using one of the datastax drivers for CQL, btw. Jon On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: This is interesting, I didn't know that! It might make sense then to use select = + async + token aware, I will try to change my code. But would it be a recomended solution for these cases? Any other options? I still would if this is the right use case for Cassandra, to look for random keys in a huge cluster. After all, the amount of connections to Cassandra will still be huge, right... Wouldn't it be a problem? Or when you use async the driver reuses the connection? []s 2014-06-19 22:16 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: If you use async and your driver is token aware, it will go to the proper node, rather than requiring the coordinator to do so. Realistically you're going to have a connection open to every server anyways. It's the difference between you querying for the data directly and using a coordinator as a proxy. It's faster to just ask the node with the data. On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: But using async queries wouldn't be even worse than using SELECT IN? The justification in the docs is I could query many nodes, but I would still do it. Today, I use both async queries AND SELECT IN: SELECT_ENTITY_LOOKUP = SELECT entity_id FROM + ENTITY_LOOKUP + WHERE name=%s and value in(%s) for name, values in identifiers.items(): query = self.SELECT_ENTITY_LOOKUP % ('%s', ','.join(['%s']*len(values))) args = [name] + values query_msg = query % tuple(args) futures.append((query_msg, self.session.execute_async(query, args))) for query_msg, future in futures: try: rows = future.result(timeout=10) for row in rows: entity_ids.add(row.entity_id) except: logging.error(Query '%s' returned ERROR % (query_msg)) raise Using async just with select = would mean instead of 1 async query (example: in (0, 1, 2)), I would do several, one for each value of values array above. In my head, this would mean more connections to Cassandra and the same amount of work, right? What would be the advantage? []s 2014-06-19 22:01 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Your other option is to fire off async queries. It's pretty straightforward w/ the java or python drivers. On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: I was taking a look at Cassandra anti-patterns list: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningAntiPatterns_c.html Among then is SELECT ... IN or index lookups¶ SELECT ... IN and index lookups (formerly secondary indexes) should be avoided except for specific scenarios. See When not to use IN in SELECT and When not to use an index in Indexing in CQL for Cassandra 2.0 And Looking at the SELECT doc, I saw: When not to use IN¶ The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster having 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range. In my system, I have a column family called entity_lookup: CREATE KEYSPACE IF NOT EXISTS Identification1 WITH REPLICATION = { 'class
Re: Adding large text blob causes read timeout...
Can you do you query in the cli after setting tracing on? On Mon, Jun 23, 2014 at 11:32 PM, DuyHai Doan doanduy...@gmail.com wrote: Yes but adding the extra one ends up by * 1000. The limit in CQL3 specifies the number of logical rows, not the number of physical columns in the storage engine Le 24 juin 2014 08:30, Kevin Burton bur...@spinn3r.com a écrit : oh.. the difference between the the ONE field and the remaining 29 is massive. It's like 200ms for just the 29 columns.. adding the extra one cause it to timeout .. 5000ms... On Mon, Jun 23, 2014 at 10:30 PM, DuyHai Doan doanduy...@gmail.com wrote: Don't forget that when you do the Select with limit set to 1000, Cassandra is actually fetching 1000 * 29 physical columns (29 fields per logical row). Adding one extra big html column may be too much and cause timeout. Try to: 1. Select only the big html only 2. Or reduce the limit incrementally until no timeout Le 24 juin 2014 06:22, Kevin Burton bur...@spinn3r.com a écrit : I have a table with a schema mostly of small fields. About 30 of them. The primary key is: primary key( bucket, sequence ) … I have 100 buckets and the idea is that sequence is ever increasing. This way I can read from bucket zero, and everything after sequence N and get all the writes ordered by time. I'm running SELECT ... FROM content WHERE bucket=0 AND sequence0 ORDER BY sequence ASC LIMIT 1000; … using the have driver. If I add ALL the fields, except one, so 29 fields, the query is fast. Only 129ms…. However, if I add the 'html' field, which is snapshot of HTML obvious, the query times out… I'm going to add tracing and try to track it down further, but I suspect I'm doing something stupid. Is it going to burn me that the data is UTF8 encoded? I can't image decoding UTF8 is going to be THAT slow but perhaps cassandra is doing something silly under the covers? cqlsh doesn't time out … it actually works fine but it uses 100% CPU while writing out the data so it's not a good comparison unfortunately ception in thread main com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: ...:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout during read)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:256) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:172) at com.datastax.driver.core.SessionManager.execute(SessionManager.java:92) at com.spinn3r.artemis.robot.console.BenchmarkContentStream.main(BenchmarkContentStream.java:100) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: dev4.wdc.sl.spinn3r.com/10.24.23.94:9042 (com.datastax.driver.core.exceptions.DriverException: Timeout during read)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:103) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com War is peace. Freedom is slavery. Ignorance is strength. Corporations are people. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Triggers and their use in data indexing
Triggers only execute on the local coordinator. I would also not recommend using them. On Thu, Jul 3, 2014 at 9:58 AM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jul 3, 2014 at 4:41 AM, Bèrto ëd Sèra berto.d.s...@gmail.com wrote: Now the question: is there any way to use triggers so that they will locally index data from remote DCs when it comes in? As I understand it, you probably should not use triggers in production in their current form. =Rob -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Triggers and their use in data indexing
This is one of the trickier areas of doing multi dc. The current recommendation is to use a separate message queue. If you'd like to see remote triggers, you could fire a JIRA. Get back to the list w/ the ticket #, I'm sure there are others who have similar needs. On Thu, Jul 3, 2014 at 10:04 AM, Jonathan Haddad j...@jonhaddad.com wrote: Triggers only execute on the local coordinator. I would also not recommend using them. On Thu, Jul 3, 2014 at 9:58 AM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jul 3, 2014 at 4:41 AM, Bèrto ëd Sèra berto.d.s...@gmail.com wrote: Now the question: is there any way to use triggers so that they will locally index data from remote DCs when it comes in? As I understand it, you probably should not use triggers in production in their current form. =Rob -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Write Inconsistency to update a row
Did you make sure all the nodes are on the same time? If they're not, you'll get some weird results. On Thu, Jul 3, 2014 at 10:30 AM, Sávio S. Teles de Oliveira savio.te...@cuia.com.br wrote: Are you sure all the nodes are working at that time? Yes. They are working. I would suggest increasing the replication factor (for example 3) and use CL=ALL or QUORUM to find out what is going wrong. I did! I still have the same problem. 2014-07-03 13:40 GMT-03:00 Panagiotis Garefalakis panga...@gmail.com: This seems like a hinted handoff issue but since you use CL = ONE it should happen. Are you sure all the nodes are working at that time? You could use nodetool status to check that. I would suggest increasing the replication factor (for example 3) and use CL=ALL or QUORUM to find out what is going wrong. Regards, Panagiotis On Thu, Jul 3, 2014 at 5:11 PM, Sávio S. Teles de Oliveira savio.te...@cuia.com.br wrote: replication_factor=1 CL=ONE Does the data show up eventually? Yes. Can be the clocks? 2014-07-03 10:47 GMT-03:00 graham sanderson gra...@vast.com: What is your keyspace replication_factor? What consistency level are you reading/writing with? Does the data show up eventually? I’m assuming you don’t have any errors (timeouts etc) on the write site On Jul 3, 2014, at 7:55 AM, Sávio S. Teles de Oliveira savio.te...@cuia.com.br wrote: I have two Cassandra 2.0.5 servers running with some datas inserted, where each row have one empty column. When the client send a lot of update commands to fill this column in each row, some lines update their content, but some lines remain with the empty column. Using one server, this never happens! Any suggestions? Tks. -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software CUIA Internet Brasil -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software CUIA Internet Brasil -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software CUIA Internet Brasil -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Write Inconsistency to update a row
Make sure you've got ntpd running, otherwise this will be an ongoing nightmare. On Thu, Jul 3, 2014 at 5:00 PM, Sávio S. Teles de Oliveira savio.te...@cuia.com.br wrote: I have synchronized the clocks and works! 2014-07-03 20:58 GMT-03:00 Sávio S. Teles de Oliveira savio.te...@cuia.com.br: Did you make sure all the nodes are on the same time? If they're not, you'll get some weird results. They were not on the same time. I've synchronized the time and works! Tks 2014-07-03 16:58 GMT-03:00 Jack Krupansky j...@basetechnology.com: You said that the updates do show up eventually – how long does it take? -- Jack Krupansky From: Sávio S. Teles de Oliveira Sent: Thursday, July 3, 2014 1:30 PM To: user@cassandra.apache.org Subject: Re: Write Inconsistency to update a row Are you sure all the nodes are working at that time? Yes. They are working. I would suggest increasing the replication factor (for example 3) and use CL=ALL or QUORUM to find out what is going wrong. I did! I still have the same problem. 2014-07-03 13:40 GMT-03:00 Panagiotis Garefalakis panga...@gmail.com: This seems like a hinted handoff issue but since you use CL = ONE it should happen. Are you sure all the nodes are working at that time? You could use nodetool status to check that. I would suggest increasing the replication factor (for example 3) and use CL=ALL or QUORUM to find out what is going wrong. Regards, Panagiotis On Thu, Jul 3, 2014 at 5:11 PM, Sávio S. Teles de Oliveira savio.te...@cuia.com.br wrote: replication_factor=1 CL=ONE Does the data show up eventually? Yes. Can be the clocks? 2014-07-03 10:47 GMT-03:00 graham sanderson gra...@vast.com: What is your keyspace replication_factor? What consistency level are you reading/writing with? Does the data show up eventually? I’m assuming you don’t have any errors (timeouts etc) on the write site On Jul 3, 2014, at 7:55 AM, Sávio S. Teles de Oliveira savio.te...@cuia.com.br wrote: I have two Cassandra 2.0.5 servers running with some datas inserted, where each row have one empty column. When the client send a lot of update commands to fill this column in each row, some lines update their content, but some lines remain with the empty column. Using one server, this never happens! Any suggestions? Tks. -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software CUIA Internet Brasil -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software CUIA Internet Brasil -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software CUIA Internet Brasil -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software CUIA Internet Brasil -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software CUIA Internet Brasil -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Cassandra use cases/Strengths/Weakness
I've used various databases in production for over 10 years. Each has strengths and weaknesses. I ran Cassandra for just shy of 2 years in production as part of both development teams and operations, and I only hit 1 serious problem that Rob mentioned. Ideally C* would have guarded against it, but it did not. I did not have any downtime as a result, however. For those curious, I tried to add 1.2 nodes to a 1.1 cluster. Aside from that, I actually did find Cassandra simple to operate manage. I used Cassandra as more of a general purpose database. I was willing to give up some query flexibility in favor of high availability and multi dc support. There were times we needed to add more servers to deal with additional load, it handled it perfectly. For me it wasn't such a big problem, there's always optimizations that need to be made no matter what DB you use. Disclaimer: I now work for Datastax. On Tue, Jul 8, 2014 at 5:51 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Jul 4, 2014 at 2:10 PM, DuyHai Doan doanduy...@gmail.com wrote: c. operational simplicity due to master-less architecture. This feature is, although quite transparent for developers, is a key selling point. Having suffered when installing manually a Hadoop cluster, I happen to love the deployment simplicity of C*, only one process per node, no moving parts. Asserting that Cassandra, as a fully functioning production system, is currently easier to operate than RDBMS is just false. It is still false even if we ignore the availability of experienced RDBMS operators and decades of RDBMS operational best practice. The quality of software engineering practice in RDBMS land also most assuredly results in a more easily operable system in many, many use cases. Yes, Cassandra is more tolerant to individual node failures. This turns out to not matter as much in terms of operability as non-operators appear to think it does. Very trivial operational activities (create a new columnfamily or replace a failed node) are subject to failure mode edge cases which often are not resolvable without brute force methods. I am unable to get my head around the oft-heard marketing assertion that a data-store in which such common activities are not bulletproof is capable of being than better to operate than the RDBMS status quo. The production operators I know also do not agree that Cassandra is simple to operate. All the above aside, I continue to maintain that Cassandra is the best at being the type of thing that it is. If you have a need to horizontally scale a use case that is well suited for its strength and poorly suited for RDBMS, you should use it. Far fewer people actually have this sort of case than think they do. =Rob -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: horizontal query scaling issues follow on
The problem with starting without vnodes is moving to them is a bit hairy. In particular, nodetool shuffle has been reported to take an extremely long time (days, weeks). I would start with vnodes if you have any intent on using them. On Thu, Jul 17, 2014 at 6:03 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jul 17, 2014 at 5:16 PM, Diane Griffith dfgriff...@gmail.com wrote: I did tests comparing 1, 2, 10, 20, 50, 100 clients spawned all querying. Performance on 2 nodes starts to degrade from 10 clients on. I saw similar behavior on 4 nodes but haven't done the official runs on that yet. Ok, if you've multi-threaded your client, then you aren't starving for client thread paralellism, and that rules out another scalability bottleneck. As a brief aside, you only lose from vnodes until your cluster is larger than a certain sizes, and then only when adding or removing nodes from a cluster. Perhaps if you are ramping up and scientifically testing smaller cluster sizes, you should start at first with a token per range, ie pre-vnodes operation? I basically did the command and it was outputting 256 tokens on each node and comma separated. So I tried taking that string and setting that as the value to initial_token but the node wouldn't start up. Not sure if I maybe had a carriage return in there and that was the problem. It should take a comma delimited list of tokens, did the failed node startup log any error? And if I do that do I need to do more than comment out num_tokens? No, though you probably should anyway in order to be unambiguous. =Rob -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: map reduce for Cassandra
Hey Marcelo, You should check out spark. It intelligently deals with a lot of the issues you're mentioning. Al Tobey did a walkthrough of how to set up the OSS side of things here: http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html It'll be less work than writing a M/R framework from scratch :) Jon On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have the need to executing a map/reduce job to identity data stored in Cassandra before indexing this data to Elastic Search. I have already used ColumnFamilyInputFormat (before start using CQL) to write hadoop jobs to do that, but I use to have a lot of troubles to perform tunning, as hadoop depends on how map tasks are split in order to successfull execute things in parallel, for IO/bound processes. First question is: Am I the only one having problems with that? Is anyone else using hadoop jobs that reads from Cassandra in production? Second question is about the alternatives. I saw new version spark will have Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to use HIVE with Cassandra community, but it seems it only works with Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/), which we have been using reading from Cassandra and so far it has been great for SQL-like queries. For custom map reduce jobs, however, it is not enough. Does anyone know some other tool that performs MR on Cassandra? My impression is most tools were created to work on top of HDFS and reading from a nosql db is some kind of workaround. Third question is about how these tools work. Most of them writtes mapped data on a intermediate storage, then data is shuffled and sorted, then it is reduced. Even when using CqlPagingInputFormat, if you are using hadoop it will write files to HDFS after the mapping phase, shuffle and sort this data, and then reduce it. I wonder if a tool supporting Cassandra out of the box wouldn't be smarter. Is it faster to write all your data to a file and then sorting it, or batch inserting data and already indexing it, as it happens when you store data in a Cassandra CF? I didn't do the calculations to check the complexity of each one, what should consider no index in Cassandra would be really large, as the maximum index size will always depend on the maximum capacity of a single host, but my guess is that a map / reduce tool written specifically to Cassandra, from the beggining, could perform much better than a tool written to HDFS and adapted. I hear people saying Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make sense? Should we expect a result like this? Final question: Do you think writting a new M/R tool like described would be reinventing the wheel? Or it makes sense? Thanks in advance. Any opinions about this subject will be very appreciated. Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: map reduce for Cassandra
I haven't tried pyspark yet, but it's part of the distribution. My main language is Python too, so I intend on getting deep into it. On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Jonathan, Do you know if this RDD can be used with Python? AFAIK, python + Cassandra will be supported just in the next version, but I would like to be wrong... Best regards, Marcelo Valle. 2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Hey Marcelo, You should check out spark. It intelligently deals with a lot of the issues you're mentioning. Al Tobey did a walkthrough of how to set up the OSS side of things here: http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html It'll be less work than writing a M/R framework from scratch :) Jon On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have the need to executing a map/reduce job to identity data stored in Cassandra before indexing this data to Elastic Search. I have already used ColumnFamilyInputFormat (before start using CQL) to write hadoop jobs to do that, but I use to have a lot of troubles to perform tunning, as hadoop depends on how map tasks are split in order to successfull execute things in parallel, for IO/bound processes. First question is: Am I the only one having problems with that? Is anyone else using hadoop jobs that reads from Cassandra in production? Second question is about the alternatives. I saw new version spark will have Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to use HIVE with Cassandra community, but it seems it only works with Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/), which we have been using reading from Cassandra and so far it has been great for SQL-like queries. For custom map reduce jobs, however, it is not enough. Does anyone know some other tool that performs MR on Cassandra? My impression is most tools were created to work on top of HDFS and reading from a nosql db is some kind of workaround. Third question is about how these tools work. Most of them writtes mapped data on a intermediate storage, then data is shuffled and sorted, then it is reduced. Even when using CqlPagingInputFormat, if you are using hadoop it will write files to HDFS after the mapping phase, shuffle and sort this data, and then reduce it. I wonder if a tool supporting Cassandra out of the box wouldn't be smarter. Is it faster to write all your data to a file and then sorting it, or batch inserting data and already indexing it, as it happens when you store data in a Cassandra CF? I didn't do the calculations to check the complexity of each one, what should consider no index in Cassandra would be really large, as the maximum index size will always depend on the maximum capacity of a single host, but my guess is that a map / reduce tool written specifically to Cassandra, from the beggining, could perform much better than a tool written to HDFS and adapted. I hear people saying Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make sense? Should we expect a result like this? Final question: Do you think writting a new M/R tool like described would be reinventing the wheel? Or it makes sense? Thanks in advance. Any opinions about this subject will be very appreciated. Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: cluster rebalancing…
You don't need to specify tokens. The new node gets them automatically. On Jul 22, 2014, at 7:03 PM, Kevin Burton bur...@spinn3r.com wrote: So , shouldn't it be easy to rebalance a cluster? I'm not super excited to type out 200 commands to move around individual tokens. I realize that this isn't a super easy solution, and that there are probably 2-3 different algorithms to pick here… but having this be the only option doesn't seem scalable. -- Founder/CEO Spinn3r.com Location: San Francisco, CA blog: http://burtonator.wordpress.com … or check out my Google+ profile
Re: vnode and NetworkTopologyStrategy: not playing well together ?
This is incorrect. Network Topology w/ Vnodes will be fine, assuming you've got RF= # of racks. For each token, replicas are chosen based on the strategy. Essentially, you could have a wild imbalance in token ownership, but it wouldn't matter because the replicas would be distributed across the rest of the machines. http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html On Tue, Aug 5, 2014 at 8:19 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Hi, My understanding is that NetworkTopologyStrategy does NOT play well with vnodes, due to: · Vnode = tokens are (usually) randomly generated (AFAIK) · NetworkTopologyStrategy = required carefully choosen tokens for all nodes in order to not to get a VERY unbalanced ring like in https://issues.apache.org/jira/browse/CASSANDRA-3810 When playing with vnodes, is the recommendation to define one rack for the entire cluster ? Thanks. Regards, Dominique -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: vnode and NetworkTopologyStrategy: not playing well together ?
* When I say wild imbalance, I do not mean all tokens on 1 node in the cluster, I really should have said slightly imbalanced On Tue, Aug 5, 2014 at 8:43 AM, Jonathan Haddad j...@jonhaddad.com wrote: This is incorrect. Network Topology w/ Vnodes will be fine, assuming you've got RF= # of racks. For each token, replicas are chosen based on the strategy. Essentially, you could have a wild imbalance in token ownership, but it wouldn't matter because the replicas would be distributed across the rest of the machines. http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html On Tue, Aug 5, 2014 at 8:19 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Hi, My understanding is that NetworkTopologyStrategy does NOT play well with vnodes, due to: · Vnode = tokens are (usually) randomly generated (AFAIK) · NetworkTopologyStrategy = required carefully choosen tokens for all nodes in order to not to get a VERY unbalanced ring like in https://issues.apache.org/jira/browse/CASSANDRA-3810 When playing with vnodes, is the recommendation to define one rack for the entire cluster ? Thanks. Regards, Dominique -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: vnode and NetworkTopologyStrategy: not playing well together ?
Yes, if you have only 1 machine in a rack then your cluster will be imbalanced. You're going to be able to dream up all sorts of weird failure cases when you choose a scenario like RF=2 totally imbalanced network arch. Vnodes attempt to solve the problem of imbalanced rings by choosing so many tokens that it's improbable that the ring will be imbalanced. On Tue, Aug 5, 2014 at 8:57 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: First, thanks for your answer. This is incorrect. Network Topology w/ Vnodes will be fine, assuming you've got RF= # of racks. IMHO, it's not a good enough condition. Let's use an example with RF=2 N1/rack_1 N2/rack_1 N3/rack_1 N4/rack_2 Here, you have RF= # of racks And due to NetworkTopologyStrategy, N4 will store *all* the cluster data, leading to a completely imbalanced cluster. IMHO, it happens when using nodes *or* vnodes. As well-balanced clusters with NetworkTopologyStrategy rely on carefully chosen token distribution/path along the ring *and* as tokens are randomly-generated with vnodes, my guess is that with vnodes and NetworkTopologyStrategy, it's better to define a single (logical) rack // due to carefully chosen tokens vs randomly-generated token clash. I don't see other options left. Do you see other ones ? Regards, Dominique -Message d'origine- De : jonathan.had...@gmail.com [mailto:jonathan.had...@gmail.com] De la part de Jonathan Haddad Envoyé : mardi 5 août 2014 17:43 À : user@cassandra.apache.org Objet : Re: vnode and NetworkTopologyStrategy: not playing well together ? This is incorrect. Network Topology w/ Vnodes will be fine, assuming you've got RF= # of racks. For each token, replicas are chosen based on the strategy. Essentially, you could have a wild imbalance in token ownership, but it wouldn't matter because the replicas would be distributed across the rest of the machines. http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html On Tue, Aug 5, 2014 at 8:19 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Hi, My understanding is that NetworkTopologyStrategy does NOT play well with vnodes, due to: · Vnode = tokens are (usually) randomly generated (AFAIK) · NetworkTopologyStrategy = required carefully choosen tokens for all nodes in order to not to get a VERY unbalanced ring like in https://issues.apache.org/jira/browse/CASSANDRA-3810 When playing with vnodes, is the recommendation to define one rack for the entire cluster ? Thanks. Regards, Dominique -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: too many open files
It really doesn't need to be this complicated. You only need 1 session per application. It's thread safe and manages the connection pool for you. http://www.datastax.com/drivers/java/2.0/com/datastax/driver/core/Session.html On Sat, Aug 9, 2014 at 1:29 PM, Kevin Burton bur...@spinn3r.com wrote: Another idea to detect this is when the number of open sessions exceeds the number of threads. On Aug 9, 2014 10:59 AM, Andrew redmu...@gmail.com wrote: I just had a generator that (in the incorrect way) had a cluster as a member variable, and would call .connect() repeatedly. I _thought_, incorrectly, that the Session was thread unsafe, and so I should request a separate Session each time—obviously wrong in hind sight. There was no special logic; I had a restriction of about 128 connections per host, but the connections were in the 100s of thousands, like the OP mentioned. Again, I’ll see about reproducing it on Monday, but just wanted the repro steps (overall) to live somewhere in case I can’t. :) Andrew On August 8, 2014 at 4:08:50 PM, Tyler Hobbs (ty...@datastax.com) wrote: On Fri, Aug 8, 2014 at 5:52 PM, Redmumba redmu...@gmail.com wrote: Just to chime in, I also ran into this issue when I was migrating to the Datastax client. Instead of reusing the session, I was opening a new session each time. For some reason, even though I was still closing the session on the client side, I was getting the same error. Which driver? If you can still reproduce this, would you mind opening a ticket? (https://datastax-oss.atlassian.net/secure/BrowseProjects.jspa#all) -- Tyler Hobbs DataStax -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: Table not being created but no error.
Can you provide the code that you use to create the table? This feels like code error rather than a database bug. On Wed, Aug 13, 2014 at 1:26 PM, Kevin Burton bur...@spinn3r.com wrote: 2.0.5… I'm upgrading to 2.0.9 now just to rule this out…. I can give you the full CQL for the table, but I can't seem to reproduce it without my entire app being included. If I execute the CQL manually, it works… which is what makes this so weird. On Wed, Aug 13, 2014 at 1:11 PM, DuyHai Doan doanduy...@gmail.com wrote: Can you just give the C* version and the complete DDL script to reproduce the issue ? On Wed, Aug 13, 2014 at 10:08 PM, Kevin Burton bur...@spinn3r.com wrote: I'm tracking down a weird bug and was wondering if you guys had any feedback. I'm trying to create ten tables programatically.. . The first one I create, for some reason, isn't created. The other 9 are created without a problem. Im doing this with the datastax driver's session.execute(). No exceptions are thrown. I read the tables back out, and I have 9 of them, but not the first one. I can confirm that the table isn't there because I'm doing a select * from foo0 limit 1 and it gives me an unconfigured column family exception. so it looks like cassandra is just silently not creating the table. This is just in my junit harness for now. So it's one cassandra node so there shouldn't be an issue with schema disagreement. Kind of stumped here so any suggestion would help. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re:
It sounds like your clocks are out of sync. Run ntpdate to fix your clock then make sure you're running ntpd on every machine. On Mon, Aug 25, 2014 at 1:25 PM, Sávio S. Teles de Oliveira savio.te...@cuia.com.br wrote: We're using cassandra 2.0.9 with datastax java cassandra driver 2.0.0 in a cluster of eight nodes. We're doing an insert and after a delete like: delete from column_family_name where id = value Immediatly select to check whether the DELETE was successful. Sometimes the value still there!! Any suggestions? -- Atenciosamente, Sávio S. Teles de Oliveira voice: +55 62 9136 6996 http://br.linkedin.com/in/savioteles Mestrando em Ciências da Computação - UFG Arquiteto de Software CUIA Internet Brasil -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re:
This is actually a more correct response than mine, I made a few assumptions that may or may not be true. On Mon, Aug 25, 2014 at 1:31 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Aug 25, 2014 at 1:25 PM, Sávio S. Teles de Oliveira savio.te...@cuia.com.br wrote: We're using cassandra 2.0.9 with datastax java cassandra driver 2.0.0 in a cluster of eight nodes. We're doing an insert and after a delete like: delete from column_family_name where id = value Immediatly select to check whether the DELETE was successful. Sometimes the value still there!! What are Replication Factor (RF) and Consistency Level (CL)? =Rob -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Failed to enable shuffling error
I believe shuffle has been removed recently. I do not recommend using it for any reason. If you really want to go vnodes, your only sane option is to add a new DC that uses vnodes and switch to it. The downside in the 2.0.x branch to using vnodes is that repairs take N times as long, where N is the number of tokens you put on each node. I can't think of any other reasons why you wouldn't want to use vnodes (but this may be significant enough for you by itself) 2.1 should address the repair issue for most use cases. Jon On Mon, Sep 8, 2014 at 1:28 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Sep 8, 2014 at 1:21 PM, Tim Heckman t...@pagerduty.com wrote: We're still at the exploratory stage on systems that are not production-facing but contain production-like data. Based on our placement strategy we have some concerns that the new datacenter approach may be riskier or more difficult. We're just trying to gauge both paths and see what works best for us. Your case of RF=N is probably the best possible case for shuffle, but general statements about how much this code has been exercised remain. :) The cluster I'm testing this on is a 5 node cluster with a placement strategy such that all nodes contain 100% of the data. In practice we have six clusters of similar size that are used for different services. These different clusters may need additional capacity at different times, so it's hard to answer the maximum size question. For now let's just assume that the clusters may never see an 11th member... but no guarantees. With RF of 3, cluster sizes of under approximately 10 tend to net lose from vnodes. If these clusters are not very likely to ever have more than 10 nodes, consider not using Vnodes. We're looking to use vnodes to help with easing the administrative work of scaling out the cluster. The improvements of streaming data during repairs amongst others. Most of these wins don't occur until you have a lot of nodes, but the fixed costs of having many ranges are paid all the time. For shuffle, it looks like it may be easier than adding a new datacenter and then have to adjust the schema for a new datacenter to come to life. And we weren't sure whether the same pitfalls of shuffle would effect us while having all data on all nodes. Let us know! Good luck! =Rob -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Failed to enable shuffling error
Thrift is still present in the 2.0 branch as well as 2.1. Where did you see that it's deprecated? Let me elaborate my earlier advice. Shuffle was removed because it doesn't work for anything beyond a trivial dataset. It is definitely more risky than adding a new vnode enabled DC, as it does not work at all. On Mon, Sep 8, 2014 at 2:01 PM, Tim Heckman t...@pagerduty.com wrote: On Mon, Sep 8, 2014 at 1:45 PM, Jonathan Haddad j...@jonhaddad.com wrote: I believe shuffle has been removed recently. I do not recommend using it for any reason. We're still using the 1.2.x branch of Cassandra, and will be for some time due to the thrift deprecation. Has it only been removed from the 2.x line? If you really want to go vnodes, your only sane option is to add a new DC that uses vnodes and switch to it. We use the NetworkTopologyStrategy across three geographically separated regions. Doing it this way feels a bit more risky based on our replication strategy. Also, I'm not sure where all we have our current datacenter names defined across our different internal repositories. So there could be quite a large number of changes going this route. The downside in the 2.0.x branch to using vnodes is that repairs take N times as long, where N is the number of tokens you put on each node. I can't think of any other reasons why you wouldn't want to use vnodes (but this may be significant enough for you by itself) 2.1 should address the repair issue for most use cases. Jon Thank you for the notes on the behaviors in the 2.x branch. If we do move to the 2.x version that's something we'll be keeping in mind. Cheers! -Tim On Mon, Sep 8, 2014 at 1:28 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Sep 8, 2014 at 1:21 PM, Tim Heckman t...@pagerduty.com wrote: We're still at the exploratory stage on systems that are not production-facing but contain production-like data. Based on our placement strategy we have some concerns that the new datacenter approach may be riskier or more difficult. We're just trying to gauge both paths and see what works best for us. Your case of RF=N is probably the best possible case for shuffle, but general statements about how much this code has been exercised remain. :) The cluster I'm testing this on is a 5 node cluster with a placement strategy such that all nodes contain 100% of the data. In practice we have six clusters of similar size that are used for different services. These different clusters may need additional capacity at different times, so it's hard to answer the maximum size question. For now let's just assume that the clusters may never see an 11th member... but no guarantees. With RF of 3, cluster sizes of under approximately 10 tend to net lose from vnodes. If these clusters are not very likely to ever have more than 10 nodes, consider not using Vnodes. We're looking to use vnodes to help with easing the administrative work of scaling out the cluster. The improvements of streaming data during repairs amongst others. Most of these wins don't occur until you have a lot of nodes, but the fixed costs of having many ranges are paid all the time. For shuffle, it looks like it may be easier than adding a new datacenter and then have to adjust the schema for a new datacenter to come to life. And we weren't sure whether the same pitfalls of shuffle would effect us while having all data on all nodes. Let us know! Good luck! =Rob -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: multi datacenter replication
Multi-dc is available in every version of Cassandra. On Wed, Sep 10, 2014 at 9:21 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Thank you very much for the links. Just to be sure: is this capability available for COMMUNITY ADDITION? Thanks Oleg. On Wed, Sep 10, 2014 at 11:49 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Oleg, Yes Replication cross DC is something available for a long time already, so it is assumed to be stable. As discussed in this thread, Cassandra documentation is often outdated or inexistant, the alternative is datastax one. http://www.datastax.com/documentation/cassandra/2.0/cassandra/initialize/initializeMultipleDS.html http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_decomission_dc_t.html Hope you'll find everything you need. If some info is missing, come back and ask. Alain 2014-09-10 16:58 GMT+02:00 Oleg Ruchovets oruchov...@gmail.com: Hi All. Is multi datacenter replication capability available in community addition? If yes can someone share the experience how stable is it and where can I read the best practice of it? Thanks Oleg. -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Concurrents deletes and updates
Make sure your clocks are synced. If they aren't, the writetime that determines the most recent value will be incorrect. On Wed, Sep 17, 2014 at 11:58 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Sep 17, 2014 at 11:55 AM, Sávio S. Teles de Oliveira savio.te...@cuia.com.br wrote: I'm using the Cassandra 2.0.9 with JAVA datastax driver. I'm running the tests in a cluster with 3 nodes, RF=3 and CL=ALL for each operation. I have a Column family filled with some keys (for example 'a' and 'b'). When this keys are deleted and inserted hereafter, sporadically this keys disappear. Is it a bug on Cassandra or on Datastax driver? Any suggestions? I would file a Cassandra JIRA with reproduction steps. http://issues.apache.org =Rob http://twitter.com/rcolidba -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Slow down of secondary index query with VNODE (C* version 1.2.18, jre6).
Keep in mind secondary indexes in cassandra are not there to improve performance, or even really be used in a serious user facing manner. Build and maintain your own view of the data, it'll be much faster. On Thu, Sep 18, 2014 at 6:33 PM, Jay Patel pateljay3...@gmail.com wrote: Hi there, We are seeing extreme slow down (500ms to 1s) in query on secondary index with vnode. I'm seeing multiple secondary index scans on a given node in trace output when vnode is enabled. Without vnode, everything is good. Cluster size: 6 nodes Replication factor: 3 Consistency level: local_quorum. Same behavior happens with consistency level of ONE. Snippet from the trace output. Pls see the attached output1.txt for the full log. Are we hitting any bug? Do not understand why coordinator sends requests multiple times to the same node (e.g. 192.168.51.22 in below output) for different token ranges. Executing indexed scan for [min(-9223372036854775808), max(-9193352069377957523)] | 23:11:30,992 | 192.168.51.22 | Executing indexed scan for (max(-9193352069377957523), max(-9136021049555745100)] | 23:11:30,998 | 192.168.51.25 | Executing indexed scan for (max(-9136021049555745100), max(-8959555493872108621)] | 23:11:30,999 | 192.168.51.22 | Executing indexed scan for (max(-8959555493872108621), max(-8929774302283364912)] | 23:11:31,000 | 192.168.51.25 | Executing indexed scan for (max(-8929774302283364912), max(-8854653908608918942)] | 23:11:31,001 | 192.168.51.22 | Executing indexed scan for (max(-8854653908608918942), max(-8762620856967633953)] | 23:11:31,002 | 192.168.51.25 | Executing indexed scan for (max(-8762620856967633953), max(-8668275030769104047)] | 23:11:31,003 | 192.168.51.22 | Executing indexed scan for (max(-8668275030769104047), max(-8659066486210615614)] | 23:11:31,003 | 192.168.51.25 | Executing indexed scan for (max(-8659066486210615614), max(-8419137646248370231)] | 23:11:31,004 | 192.168.51.22 | Executing indexed scan for (max(-8419137646248370231), max(-8416786876632807845)] | 23:11:31,005 | 192.168.51.25 | Executing indexed scan for (max(-8416786876632807845), max(-8315889933848495185)] | 23:11:31,006 | 192.168.51.22 | Executing indexed scan for (max(-8315889933848495185), max(-8270922890152952193)] | 23:11:31,006 | 192.168.51.25 | Executing indexed scan for (max(-8270922890152952193), max(-8260813759533312175)] | 23:11:31,007 | 192.168.51.22 | Executing indexed scan for (max(-8260813759533312175), max(-8234845345932129353)] | 23:11:31,008 | 192.168.51.25 | Executing indexed scan for (max(-8234845345932129353), max(-8216636461332030758)] | 23:11:31,008 | 192.168.51.22 | Thanks, Jay -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Blocking while a node finishes joining the cluster after restart.
Depending on how you query (one or quorum) you might be able to do 1 rack at a time (or az or whatever you've got) assuming your snitch is set up right On Sep 19, 2014, at 11:30 AM, Kevin Burton bur...@spinn3r.com wrote: This is great feedback… I think it could actually be even easier than this… You could have an ansible (or whatever cluster management system you’re using) role for just seeds. Then you would serially restart all seeds one at a time. You would need to run ‘nodetool status’ and make sure the node is ‘U’ (up) I think.. but you might want to make sure the majority of other nodes have agreed that this node is up and available. I think you can ONLY do this serially.. .for a LARGE number of hosts, this might take a while unless you can compute nodes which have mutually exclusive key ranges. The serial approach would take a LONG time for large clusters. If you have sixty nodes, it could take an hour to do a rolling restart. Kevin On Tue, Sep 16, 2014 at 12:21 PM, James Briggs james.bri...@yahoo.com wrote: FYI: OpsCenter has a default of sleep 60 seconds after each node restart, and an option of drain before stopping. I haven't noticed if they do anything special with seeds. (At least one seed needs to be running before you restart other nodes.) I wondered the same thing as Kevin and came to these conclusions. Fixing the startup script is non-trivial as far as startup scripts go. For start, it would have to: - parse cassandra.yaml for seeds - if itself is not a seed, wait for a seed to start first. (could take minutes or never.) - continue start. For a no-downtime cluster restart script, it would have to: - verify cluster health (ie. quorum/CL is met or you lose writes) - parse cassandra.yaml for seeds and see if a seed is up - stop gossip and thrift - maybe do compaction before drain - drain node - stop/start or restart cassandra process. http://comments.gmane.org/gmane.comp.db.cassandra.user/20144 Both of those scripts would be nice to have. :) OpsCenter is flaky at doing rolling restart in my test cluster, so an alternative is needed. Also, the free OpsCenter doesn't have rolling repair option enabled. ccm has the options to do drain, stop and start, but a bash script would be needed to make it rolling. https://github.com/pcmanus/ccm Thanks, James. -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Duncan Sands duncan.sa...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, September 16, 2014 11:09 AM Subject: Re: Blocking while a node finishes joining the cluster after restart. Hi Kevin, if you are using the latest version of opscenter, then even the community (= free) edition can do a rolling restart of your cluster. It's pretty convenient. Ciao, Duncan. On 16/09/14 19:44, Kevin Burton wrote: Say I want to do a rolling restart of Cassandra… I can’t just restart all of them because they need some time to gossip and for that gossip to get to all nodes. What is the best strategy for this. It would be something like: /etc/init.d/cassandra restart wait-for-cassandra.sh … or something along those lines. -- Founder/CEO Spinn3r.com http://Spinn3r.com Location: *San Francisco, CA* blog:**http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com -- Founder/CEO Spinn3r.com Location: San Francisco, CA blog: http://burtonator.wordpress.com … or check out my Google+ profile
Re: Difference in retrieving data from cassandra
You'll need to provide a bit of information. To start, a query trace from would be helpful. http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/tracing_r.html (self promo) You may want to read over my blog post regarding diagnosing problems in production. I've covered diagnosing slow queries: http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/ On Thu, Sep 25, 2014 at 4:21 AM, Umang Shah shahuma...@gmail.com wrote: Hi All, I am using cassandra with Pentaho PDI kettle, i have installed cassandra in Amazon EC2 instance and in local-machine, so when i am trying to retrieve data from local machine using Pentaho PDI it is taking few seconds (not more then 20 seconds) and if i do the same using production data-base it takes almost 3 minutes for the same number of data , which is huge difference. So if anybody can give me some comments of solution that what i need to check for this or how can i narrow down this difference? on local machine and production server RAM is same. Local machine is windows environment and production is Linux. -- Regards, Umang V.Shah BI-ETL Developer -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Repair taking long time
Are you using Cassandra 2.0 vnodes? If so, repair takes forever. This problem is addressed in 2.1. On Fri, Sep 26, 2014 at 9:52 AM, Gene Robichaux gene.robich...@match.com wrote: I am fairly new to Cassandra. We have a 9 node cluster, 5 in one DC and 4 in another. Running a repair on a large column family seems to be moving much slower than I expect. Looking at nodetool compaction stats it indicates the Validation phase is running that the total bytes is 4.5T (4505336278756). This is a very large CF. The process has been running for 2.5 hours and has processed 71G (71950433062). That rate is about 28.4 GB per hour. At this rate it will take 158 hours, just shy of 1 week. Is this reasonable? This is my first large repair and I am wondering if this is normal for a CF of this size. Seems like a long time to me. Is it possible to tune this process to speed it up? Is there something in my configuration that could be causing this slow performance? I am running HDDs, not SSDs in a JBOD configuration. Gene Robichaux Manager, Database Operations Match.com 8300 Douglas Avenue I Suite 800 I Dallas, TX 75225 Phone: 214-576-3273 -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Repair taking long time
If you're using DSE you might want to contact Datastax support, rather than the ML. On Fri, Sep 26, 2014 at 10:52 AM, Gene Robichaux gene.robich...@match.com wrote: I am on DSE 4.0.3 which is 2.0.7. If 4.5.1 is NOT 2.1. I guess an upgrade will not buy me much….. The bad thing is that table is not our largest….. :( Gene Robichaux Manager, Database Operations Match.com 8300 Douglas Avenue I Suite 800 I Dallas, TX 75225 Phone: 214-576-3273 From: Brice Dutheil [mailto:brice.duth...@gmail.com] Sent: Friday, September 26, 2014 12:47 PM To: user@cassandra.apache.org Subject: Re: Repair taking long time Unfortunately DSE 4.5.0 is still on 2.0.x -- Brice On Fri, Sep 26, 2014 at 7:40 PM, Jonathan Haddad j...@jonhaddad.com wrote: Are you using Cassandra 2.0 vnodes? If so, repair takes forever. This problem is addressed in 2.1. On Fri, Sep 26, 2014 at 9:52 AM, Gene Robichaux gene.robich...@match.com wrote: I am fairly new to Cassandra. We have a 9 node cluster, 5 in one DC and 4 in another. Running a repair on a large column family seems to be moving much slower than I expect. Looking at nodetool compaction stats it indicates the Validation phase is running that the total bytes is 4.5T (4505336278756). This is a very large CF. The process has been running for 2.5 hours and has processed 71G (71950433062). That rate is about 28.4 GB per hour. At this rate it will take 158 hours, just shy of 1 week. Is this reasonable? This is my first large repair and I am wondering if this is normal for a CF of this size. Seems like a long time to me. Is it possible to tune this process to speed it up? Is there something in my configuration that could be causing this slow performance? I am running HDDs, not SSDs in a JBOD configuration. Gene Robichaux Manager, Database Operations Match.com 8300 Douglas Avenue I Suite 800 I Dallas, TX 75225 Phone: 214-576-3273 -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Repair taking long time
Well, in that case, you may want to roll your own script for doing constant repairs of your cluster, and extend your gc grace seconds so you can repair the whole cluster before the tombstones are cleared. On Fri, Sep 26, 2014 at 11:15 AM, Gene Robichaux gene.robich...@match.com wrote: Using their community edition..no support (yet!) :( Gene Robichaux Manager, Database Operations Match.com 8300 Douglas Avenue I Suite 800 I Dallas, TX 75225 Phone: 214-576-3273 -Original Message- From: jonathan.had...@gmail.com [mailto:jonathan.had...@gmail.com] On Behalf Of Jonathan Haddad Sent: Friday, September 26, 2014 12:58 PM To: user@cassandra.apache.org Subject: Re: Repair taking long time If you're using DSE you might want to contact Datastax support, rather than the ML. On Fri, Sep 26, 2014 at 10:52 AM, Gene Robichaux gene.robich...@match.com wrote: I am on DSE 4.0.3 which is 2.0.7. If 4.5.1 is NOT 2.1. I guess an upgrade will not buy me much….. The bad thing is that table is not our largest….. :( Gene Robichaux Manager, Database Operations Match.com 8300 Douglas Avenue I Suite 800 I Dallas, TX 75225 Phone: 214-576-3273 From: Brice Dutheil [mailto:brice.duth...@gmail.com] Sent: Friday, September 26, 2014 12:47 PM To: user@cassandra.apache.org Subject: Re: Repair taking long time Unfortunately DSE 4.5.0 is still on 2.0.x -- Brice On Fri, Sep 26, 2014 at 7:40 PM, Jonathan Haddad j...@jonhaddad.com wrote: Are you using Cassandra 2.0 vnodes? If so, repair takes forever. This problem is addressed in 2.1. On Fri, Sep 26, 2014 at 9:52 AM, Gene Robichaux gene.robich...@match.com wrote: I am fairly new to Cassandra. We have a 9 node cluster, 5 in one DC and 4 in another. Running a repair on a large column family seems to be moving much slower than I expect. Looking at nodetool compaction stats it indicates the Validation phase is running that the total bytes is 4.5T (4505336278756). This is a very large CF. The process has been running for 2.5 hours and has processed 71G (71950433062). That rate is about 28.4 GB per hour. At this rate it will take 158 hours, just shy of 1 week. Is this reasonable? This is my first large repair and I am wondering if this is normal for a CF of this size. Seems like a long time to me. Is it possible to tune this process to speed it up? Is there something in my configuration that could be causing this slow performance? I am running HDDs, not SSDs in a JBOD configuration. Gene Robichaux Manager, Database Operations Match.com 8300 Douglas Avenue I Suite 800 I Dallas, TX 75225 Phone: 214-576-3273 -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Performance Issue: Keeping rows in memory
First, did you run a query trace? I recommend Al Tobey's pcstat util to determine if your files are in the buffer cache: https://github.com/tobert/pcstat On Wed, Oct 22, 2014 at 4:34 AM, Thomas Whiteway thomas.white...@metaswitch.com wrote: Hi, I’m working on an application using a Cassandra (2.1.0) cluster where - our entire dataset is around 22GB - each node has 48GB of memory but only a single (mechanical) hard disk - in normal operation we have a low level of writes and no reads - very occasionally we need to read rows very fast (1.5K rows/second), and only read each row once. When we try and read the rows it takes up to five minutes before Cassandra is able to keep up. The problem seems to be that it takes a while to get the data into the page cache and until then Cassandra can’t retrieve the data from disk fast enough (e.g. if I drop the page cache mid-test then Cassandra slows down for the next 5 minutes). Given that the total amount of should fit comfortably in memory I’ve been trying to find a way to keep the rows cached in memory but there doesn’t seem to be a particularly great way to achieve this. I’ve tried enabling the row cache and pre-populating the test by querying every row before starting the load which gives good performance, but the row cache isn’t really intended to be used this way and we’d be fighting the row cache to keep the rows in (e.g. by cyclically reading through all the rows during normal operation). Keeping the page cache warm by running a background task to keep accessing the files for the sstables would be simpler and currently this is the solution we’re leaning towards, but we have less control over the page cache, it would be vulnerable to other processes knocking Cassandra’s files out, and it generally feels like a bit of a hack. Has anyone had any success with trying to do something similar to this or have any suggestions for possible solutions? Thanks, Thomas -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Is cassandra smart enough to serve Read requests entirely from Memtables in some cases?
No. Consider a scenario where you supply a timestamp a week in the future, flush it to sstable, and then do a write, with the current timestamp. The record in disk will have a timestamp greater than the one in the memtable. On Wed, Oct 22, 2014 at 9:18 AM, Donald Smith donald.sm...@audiencescience.com wrote: Question about the read path in cassandra. If a partition/row is in the Memtable and is being actively written to by other clients, will a READ of that partition also have to hit SStables on disk (or in the page cache)? Or can it be serviced entirely from the Memtable? If you select all columns (e.g., “*select * from ….*”) then I can imagine that cassandra would need to merge whatever columns are in the Memtable with what’s in SStables on disk. But if you select a single column (e.g., “*select Name from …. where id= …*.”) and if that column is in the Memtable, I’d hope cassandra could skip checking the disk. Can it do this optimization? Thanks, Don *Donald A. Smith* | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.com [image: AudienceScience] -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: OOM at Bootstrap Time
If the issue is related to I/O, you're going to want to determine if you're saturated. Take a look at `iostat -dmx 1`, you'll see avgqu-sz (queue size) and svctm, (service time).The higher those numbers are, the most overwhelmed your disk is. On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan doanduy...@gmail.com wrote: Hello Maxime Increasing the flush writers won't help if your disk I/O is not keeping up. I've had a look into the log file, below are some remarks: 1) There are a lot of SSTables on disk for some tables (events for example, but not only). I've seen that some compactions are taking up to 32 SSTables (which corresponds to the default max value for SizeTiered compaction). 2) There is a secondary index that I found suspicious : loc.loc_id_idx. As its name implies I have the impression that it's an index on the id of the loc which would lead to almost an 1-1 relationship between the indexed value and the original loc. Such index should be avoided because they do not perform well. If it's not an index on the loc_id, please disregard my remark 3) There is a clear imbalance of SSTable count on some nodes. In the log, I saw: INFO [STREAM-IN-/...20] 2014-10-25 02:21:43,360 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79 ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes), sending 0 files(0 bytes) INFO [STREAM-IN-/...81] 2014-10-25 02:21:46,121 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79 ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes), sending 0 files(0 bytes) INFO [STREAM-IN-/...71] 2014-10-25 02:21:50,494 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79 ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes), sending 0 files(0 bytes) INFO [STREAM-IN-/...217] 2014-10-25 02:21:51,036 StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79 ID#0] Prepare completed. Receiving 1640 files(3 208 023 573 bytes), sending 0 files(0 bytes) As you can see, the existing 4 nodes are streaming data to the new node and on average the data set size is about 3.3 - 4.5 Gb. However the number of SSTables is around 150 files for nodes ...20 and ...81 but goes through the roof to reach 1315 files for ...71 and 1640 files for ...217 The total data set size is roughly the same but the file number is x10, which mean that you'll have a bunch of tiny files. I guess that upon reception of those files, there will be a massive flush to disk, explaining the behaviour you're facing (flush storm) I would suggest looking on nodes ...71 and ...217 to check for the total SSTable count for each table to confirm this intuition Regards On Sun, Oct 26, 2014 at 4:58 PM, Maxime maxim...@gmail.com wrote: I've emailed you a raw log file of an instance of this happening. I've been monitoring more closely the timing of events in tpstats and the logs and I believe this is what is happening: - For some reason, C* decides to provoke a flush storm (I say some reason, I'm sure there is one but I have had difficulty determining the behaviour changes between 1.* and more recent releases). - So we see ~ 3000 flush being enqueued. - This happens so suddenly that even boosting the number of flush writers to 20 does not suffice. I don't even see all time blocked numbers for it before C* stops responding. I suspect this is due to the sudden OOM and GC occurring. - The last tpstat that comes back before the node goes down indicates 20 active and 3000 pending and the rest 0. It's by far the anomalous activity. Is there a way to throttle down this generation of Flush? C* complains if I set the queue_size to any value (deprecated now?) and boosting the threads does not seem to help since even at 20 we're an order of magnitude off. Suggestions? Comments? On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan doanduy...@gmail.com wrote: Hello Maxime Can you put the complete logs and config somewhere ? It would be interesting to know what is the cause of the OOM. On Sun, Oct 26, 2014 at 3:15 AM, Maxime maxim...@gmail.com wrote: Thanks a lot that is comforting. We are also small at the moment so I definitely can relate with the idea of keeping small and simple at a level where it just works. I see the new Apache version has a lot of fixes so I will try to upgrade before I look into downgrading. On Saturday, October 25, 2014, Laing, Michael michael.la...@nytimes.com wrote: Since no one else has stepped in... We have run clusters with ridiculously small nodes - I have a production cluster in AWS with 4GB nodes each with 1 CPU and disk-based instance storage. It works fine but you can see those little puppies struggle... And I ran into problems such as you observe... Upgrading Java to the
Re: read after write inconsistent even on a one node cluster
For cqlengine we do quite a bit of write then read to ensure data was written correctly, across 1.2, 2.0, and 2.1. For what it's worth, I've never seen this issue come up. On a single node, Cassandra only acks the write after it's been written into the memtable. So, you'd expect to see the most recent data. A possibility - if you're running in a VM, it's possible the clock isn't incrementing in real time? I've seen this happen with uuid1 generation - I was getting duplicates if I generated them fast enough. Perhaps you're writing 2 values one right after the other and they're getting the same millisecond precision timestamp. On Thu, Nov 6, 2014 at 10:26 AM, Robert Coli rc...@eventbrite.com wrote: On Thu, Nov 6, 2014 at 6:14 AM, Brian Tarbox briantar...@gmail.com wrote: We write values to our keyspaces and then immediately read the values back (in our Cucumber tests). About 20% of the time we get the old value.if we wait 1 second and redo the query (within the same java method) we get the new value. This is all happening on a single node...how is this possible? It sounds unreasonable/unexpected to me, if you have a trivial repro case, I would file a JIRA. =Rob -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: query tracing
Personally I've found that using query timing + log aggregation on the client side is more effective than trying to mess with tracing probability in order to find a single query which has recently become a problem. I recommend wrapping your session with something that can automatically log the statement on a slow query, then use tracing to identify exactly what happened. This way finding your problem is not a matter of chance. On Fri Nov 07 2014 at 9:41:38 AM Chris Lohfink clohfin...@gmail.com wrote: It saves a lot of information for each request thats traced so there is significant overhead. If you start at a low probability and move it up based on the load impact it will provide a lot of insight and you can control the cost. --- Chris Lohfink On Fri, Nov 7, 2014 at 11:35 AM, Jimmy Lin y2klyf+w...@gmail.com wrote: is there any significant performance penalty if one turn on Cassandra query tracing, through DataStax java driver (say, per every query request of some trouble query)? More sampling seems better but then doing so may also slow down the system in some other ways? thanks
Re: PHP - Cassandra integration
In production? On Mon Nov 10 2014 at 6:06:41 AM Spencer Brown lilspe...@gmail.com wrote: I'm using /McFrazier/PhpBinaryCql/ On Mon, Nov 10, 2014 at 1:48 AM, Akshay Ballarpure akshay.ballarp...@tcs.com wrote: Hello, I am working on PHP cassandra integration, please let me know which library is good from scalability and performance perspective ? Best Regards Akshay Ballarpure Tata Consultancy Services Cell:- 9985084075 Mailto: akshay.ballarp...@tcs.com Website: http://www.tcs.com Experience certainty.IT Services Business Solutions Consulting =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Re: Cassandra sort using updatable query
With Cassandra you're going to want to model tables to meet the requirements of your queries instead of like a relational database where you build tables in 3NF then optimize after. For your optimized select query, your table (with caveat, see below) could start out as: create table words ( year int, frequency int, content text, primary key (year, frequency, content) ); You may want to maintain other tables as well for different types of select statements. Your UPDATE statement above won't work, you'll have to DELETE and INSERT, since you can't change the value of a clustering column. If you don't know what your old frequency is ahead of time (to do the delete), you'll need to keep another table mapping content,year - frequency. Now, the tricky part here is that the above model will limit the total number of partitions you've got to the number of years you're working with, and will not scale as the cluster increases in size. Ideally you could bucket frequencies. If that feels like too much work (it's starting to for me), this may be better suited to something like solr, elastic search, or DSE (cassandra + solr). Does that help? Jon On Wed Nov 12 2014 at 9:01:44 AM Chamila Wijayarathna cdwijayarat...@gmail.com wrote: Hello all, I have a data set with attributes content and year. I want to put them in to CF 'words' with attributes ('content','year','frequency'). The CF should support following operations. - Frequency attribute of a column can be updated (i.e. - : can run query like UPDATE words SET frequency = 2 WHERE content='abc' AND year=1990;), where clause should contain content and year - Should support select query like Select content from words where year = 2010 ORDER BY frequency DESC LIMIT 10; (where clause only has year) where results can be ordered using frequency Is this kind of requirement can be fulfilled using Cassandra? What is the CF structure and indexing I need to use here? What queries should I use to create CF and in indexing? Thank You! -- *Chamila Dilshan Wijayarathna,* SMIEEE, SMIESL, Undergraduate, Department of Computer Science and Engineering, University of Moratuwa.
Re: Is it more performant to split data with the same schema into multiple keyspaces, as supposed to put all of them into the same keyspace?
Performance will be the same. There's no performance benefit to using multiple keyspaces. On Thu Nov 13 2014 at 8:42:40 AM Li, George guangxing...@pearson.com wrote: Hi, we use Cassandra to store some association type of data. For example, store user to course (course registrations) association and user to school (school enrollment) association data. The schema for these two types of associations are the same. So there are two options to store the data: 1. Put user to course association data into one keyspace, and user to school association data into another keyspace. 2. Put both of them into the same keyspace. In the long run, such data will grow to be very large. With that in mind, is it better to use the first approach (having multiple keyspaces) for better performance? Thanks. George
Re: Is it more performant to split data with the same schema into multiple keyspaces, as supposed to put all of them into the same keyspace?
Tables, yes, but that wasn't the question. The question was around using different keyspaces. On Thu Nov 13 2014 at 9:17:30 AM Tyler Hobbs ty...@datastax.com wrote: That's not necessarily true. You don't need to split them into separate keyspaces, but separate tables may have some advantages. For example, in Cassandra 2.1, compaction and index summary management are optimized based on read rates for SSTables. If you have different read rates or patterns for the two types of data, it will confuse/eliminate these optimizations. If you have two separate sets of data with (potentially) two separate read patterns, don't put them in the same table. On Thu, Nov 13, 2014 at 11:08 AM, Jonathan Haddad j...@jonhaddad.com wrote: Performance will be the same. There's no performance benefit to using multiple keyspaces. On Thu Nov 13 2014 at 8:42:40 AM Li, George guangxing...@pearson.com wrote: Hi, we use Cassandra to store some association type of data. For example, store user to course (course registrations) association and user to school (school enrollment) association data. The schema for these two types of associations are the same. So there are two options to store the data: 1. Put user to course association data into one keyspace, and user to school association data into another keyspace. 2. Put both of them into the same keyspace. In the long run, such data will grow to be very large. With that in mind, is it better to use the first approach (having multiple keyspaces) for better performance? Thanks. George -- Tyler Hobbs DataStax http://datastax.com/
Re: Deduplicating data on a node (RF=1)
If he deletes all the data with RF=1, won't he have data loss? On Mon Nov 17 2014 at 5:14:23 PM Michael Shuler mich...@pbandjelly.org wrote: On 11/17/2014 02:04 PM, Alain Vandendorpe wrote: Hey all, For legacy reasons we're living with Cassandra 2.0.10 in an RF=1 setup. This is being moved away from ASAP. In the meantime, adding a node recently encountered a Stream Failed error (http://pastie.org/9725846). Cassandra restarted and it seemingly restarted streaming from zero, without having removed the failed stream's data. With bootstrapping and initial compactions finished that node now has what seems to be duplicate data, with almost exactly 2x the expected disk usage. CQL returns correct results but we depend on the ability to directly read the SSTable files (hence also RF=1.) Would anyone have suggestions on a good way to resolve this? Start over fresh, deleting *all* the data, and bootstrap the node again? -- Michael
Re: Using Cassandra for session tokens
I don't think DateTiered will help here, since there's no clustering key defined. This is a pretty straightforward workload, I've done something similar. Are you overwriting the session on every request? Or just writing it once? On Mon Dec 01 2014 at 6:45:14 AM Matt Brown m...@mattnworb.com wrote: This sounds like a good use case for http://www.datastax.com/dev/blog/datetieredcompactionstrategy On Dec 1, 2014, at 3:07 AM, Phil Wise p...@advancedtelematic.com wrote: We're considering switching from using Redis to Cassandra to store short lived (~1 hour) session tokens, in order to reduce the number of data storage engines we have to manage. Can anyone foresee any problems with the following approach: 1) Use the TTL functionality in Cassandra to remove old tokens. 2) Store the tokens in a table like: CREATE TABLE tokens ( id uuid, username text, // (other session information) PRIMARY KEY (id) ); 3) Perform ~100 writes/sec like: INSERT INTO tokens (id, username ) VALUES (468e0d69-1ebe-4477-8565-00a4cb6fa9f2, 'bob') USING TTL 3600; 4) Perform ~1000 reads/sec like: SELECT * FROM tokens WHERE ID=468e0d69-1ebe-4477-8565-00a4cb6fa9f2 ; The tokens will be about 100 bytes each, and we will grant 100 per second on a small 3 node cluster. Therefore there will be about 360k tokens alive at any time, with a total size of 36MB before database overhead. My biggest worry at the moment is that this kind of workload will stress compaction in an unusual way. Are there any metrics I should keep an eye on to make sure it is working fine? I read over the following links, but they mostly talk about DELETE-ing and tombstones. Am I right in thinking that as soon as a node performs a compaction then the rows with an expired TTL will be thrown away, regardless of gc_grace_seconds? https://issues.apache.org/jira/browse/CASSANDRA-7534 http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets https://issues.apache.org/jira/browse/CASSANDRA-6654 Thank you Phil
Re: Using Cassandra for session tokens
I don't know what the advantage would be of using this sharding system. I would recommend just going with a simple k-v table as the OP suggested. On Mon Dec 01 2014 at 7:18:51 AM Laing, Michael michael.la...@nytimes.com wrote: Since the session tokens are random, perhaps computing a shard from each one and using it as the partition key would be a good idea. I would also use uuid v1 to get ordering. With such a small amount of data, only a few shards would be needed. On Mon, Dec 1, 2014 at 10:08 AM, Phil Wise p...@advancedtelematic.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 The session will be written once at create time, and never modified after that. Will that affect things? Thank you - -Phil On 01.12.2014 15:58, Jonathan Haddad wrote: I don't think DateTiered will help here, since there's no clustering key defined. This is a pretty straightforward workload, I've done something similar. Are you overwriting the session on every request? Or just writing it once? On Mon Dec 01 2014 at 6:45:14 AM Matt Brown m...@mattnworb.com wrote: This sounds like a good use case for http://www.datastax.com/dev/blog/datetieredcompactionstrategy On Dec 1, 2014, at 3:07 AM, Phil Wise p...@advancedtelematic.com wrote: We're considering switching from using Redis to Cassandra to store short lived (~1 hour) session tokens, in order to reduce the number of data storage engines we have to manage. Can anyone foresee any problems with the following approach: 1) Use the TTL functionality in Cassandra to remove old tokens. 2) Store the tokens in a table like: CREATE TABLE tokens ( id uuid, username text, // (other session information) PRIMARY KEY (id) ); 3) Perform ~100 writes/sec like: INSERT INTO tokens (id, username ) VALUES (468e0d69-1ebe-4477-8565-00a4cb6fa9f2, 'bob') USING TTL 3600; 4) Perform ~1000 reads/sec like: SELECT * FROM tokens WHERE ID=468e0d69-1ebe-4477-8565-00a4cb6fa9f2 ; The tokens will be about 100 bytes each, and we will grant 100 per second on a small 3 node cluster. Therefore there will be about 360k tokens alive at any time, with a total size of 36MB before database overhead. My biggest worry at the moment is that this kind of workload will stress compaction in an unusual way. Are there any metrics I should keep an eye on to make sure it is working fine? I read over the following links, but they mostly talk about DELETE-ing and tombstones. Am I right in thinking that as soon as a node performs a compaction then the rows with an expired TTL will be thrown away, regardless of gc_grace_seconds? https://issues.apache.org/jira/browse/CASSANDRA-7534 http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets https://issues.apache.org/jira/browse/CASSANDRA-6654 Thank you Phil -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQIcBAEBAgAGBQJUfIR1AAoJEAvGtrO88FBAnpAP/0RCdwCy4Wi0ogz24SRKpCu0 c/i6O2HBTinl2RXLoH9xMOT8kXJ82P9tVDeKjLQAZYnBgRwF7Fcbvd40GPf+5aaj aU1TkU4jLnDCeFTwG/vx+TIfZEE27nppsECLtfmnzJEl/4yZwAG3Dy+VkuqBurMu J6If9bMnseEgvF1onmA7ZLygJq44tlgOGyHT0WdYRX7CwAE6HeyxMC38ArarRU37 dfGhsttBMqdxHreKE0CqRZZ67iT+KixGoUeCvZUnTvOLTsrEWO17yTezQDamAee0 jIsVfgKqqhoiKeAj99J75rcsIT3WAbS23MV1s92AQXYkpR1KmHTB6KvUjH2AQBew 9xwdDSg/eVsdQNkGbtSJ2cNPnFuBBZv2kzW5PVyQ625bMHNAF2GE9rLIKddMUbNQ LiwOPAJDWBJeZnJYj3cncdfC2Jw1H4rlV0k6BHwdzZUrEdbvUKlHtyl8/ZsZnJHs SrPsiYQa0NI6C+faAFqzBEyLhsWdJL3ygNZTo4CW3I8z+yYEyzZtmKPDmHdVzK/M M8GlaRYw1t7OY81VBXKcmPyr5Omti7wtkffC6bhopsPCm7ATSq2r46z8OFlkUdJl wcTMJM0E6gZtiMIr3D+WbOTzI5kPX6x4UB3ec3xq6+GIObPwioVAJf3ADmIK4iHT G106NwdUnag5XlnbwgMX =6zXb -END PGP SIGNATURE-
Re: full gc too often
I recommend reading through https://issues.apache.org/jira/browse/CASSANDRA-8150 to get an idea of how the JVM GC works and what you can do to tune it. Also good is Blake Eggleston's writeup which can be found here: http://blakeeggleston.com/cassandra-tuning-the-jvm-for-read-heavy-workloads.html I'd like to note that allocating 4GB heap to Cassandra under any serious workload is unlikely to be sufficient. On Thu Dec 04 2014 at 8:43:38 PM Philo Yang ud1...@gmail.com wrote: I have two kinds of machine: 16G RAM, with default heap size setting, about 4G. 64G RAM, with default heap size setting, about 8G. These two kinds of nodes have same number of vnodes, and both of them have gc issue, although the node of 16G have a higher probability of gc issue. Thanks, Philo Yang 2014-12-05 12:34 GMT+08:00 Tim Heckman t...@pagerduty.com: On Dec 4, 2014 8:14 PM, Philo Yang ud1...@gmail.com wrote: Hi,all I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with full gc that sometime there may be one or two nodes full gc more than one time per minute and over 10 seconds each time, then the node will be unreachable and the latency of cluster will be increased. I grep the GCInspector's log, I found when the node is running fine without gc trouble there are two kinds of gc: ParNew GC in less than 300ms which clear the Par Eden Space and enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in more than 200ms, there is only a small number of ParNew GC in log) ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and enlarge Par Eden Space little, each 1-2 hours it will be executed once. However, sometimes ConcurrentMarkSweep will be strange like it shows: INFO [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 - ConcurrentMarkSweep GC in 12648ms. CMS Old Gen: 3579838424 - 3579838464; Par Eden Space: 503316480 - 294794576; Par Survivor Space: 62914528 - 0 INFO [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 - ConcurrentMarkSweep GC in 12227ms. CMS Old Gen: 3579838464 - 3579836512; Par Eden Space: 503316480 - 310562032; Par Survivor Space: 62872496 - 0 INFO [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 - ConcurrentMarkSweep GC in 11538ms. CMS Old Gen: 3579836688 - 3579805792; Par Eden Space: 503316480 - 332391096; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 - ConcurrentMarkSweep GC in 12180ms. CMS Old Gen: 3579835784 - 3579829760; Par Eden Space: 503316480 - 351991456; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 - ConcurrentMarkSweep GC in 10574ms. CMS Old Gen: 3579838112 - 3579799752; Par Eden Space: 503316480 - 366222584; Par Survivor Space: 62914560 - 0 INFO [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 - ConcurrentMarkSweep GC in 11594ms. CMS Old Gen: 3579831424 - 3579817392; Par Eden Space: 503316480 - 388702928; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 - ConcurrentMarkSweep GC in 11463ms. CMS Old Gen: 3579817392 - 3579838424; Par Eden Space: 503316480 - 408992784; Par Survivor Space: 62896720 - 0 INFO [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 - ConcurrentMarkSweep GC in 9576ms. CMS Old Gen: 3579838424 - 3579816424; Par Eden Space: 503316480 - 438633608; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 - ConcurrentMarkSweep GC in 11556ms. CMS Old Gen: 3579816424 - 3579785496; Par Eden Space: 503316480 - 441354856; Par Survivor Space: 62889528 - 0 INFO [Service Thread] 2014-12-05 11:30:54,085 GCInspector.java:142 - ConcurrentMarkSweep GC in 12082ms. CMS Old Gen: 3579786592 - 3579814464; Par Eden Space: 503316480 - 448782440; Par Survivor Space: 62914560 - 0 In each time Old Gen reduce only a little, Survivor Space will be clear but the heap is still full so there will be another full gc very soon then the node will down. If I restart the node, it will be fine without gc trouble. Can anyone help me to find out where is the problem that full gc can't reduce CMS Old Gen? Is it because there are too many objects in heap can't be recycled? I think review the table scheme designing and add new nodes into cluster is a good idea, but I still want to know if there is any other reason causing this trouble. How much total system memory do you have? How much is allocated for heap usage? How big is your working data set? The reason I ask is that I've seen problems with lots of GC with no room gained, and it was memory pressure. Not enough for the heap. We decided that just increasing the heap size was a bad idea, as we did rely on free RAM being used for filesystem caching. So some vertical and horizontal scaling allowed us to give Cass more heap space, as well as distribute
Re: Could ring cache really improve performance in Cassandra?
What's a ring cache? FYI if you're using the DataStax CQL drivers they will automatically route requests to the correct node. On Sun Dec 07 2014 at 12:59:36 AM kong kongjiali...@gmail.com wrote: Hi, I'm doing stress test on Cassandra. And I learn that using ring cache can improve the performance because the client requests can directly go to the target Cassandra server and the coordinator Cassandra node is the desired target node. In this way, there is no need for coordinator node to route the client requests to the target node, and maybe we can get the linear performance increment. However, in my stress test on an Amazon EC2 cluster, the test results are weird. Seems that there's no performance improvement after using ring cache. Could anyone help me explain this results? (Also, I think the results of test without ring cache is weird, because there's no linear increment on QPS when new nodes are added. I need help on explaining this, too). The results are as follows: INSERT(write): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 18687 20195 2 1 20793 26403 2 2 22498 21263 4 1 28348 30010 4 3 28631 24413 SELECT(read): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 24498 22802 2 1 28219 27030 2 2 35383 36674 4 1 34648 28347 4 3 52932 52590 Thank you very much, Joy
Re: full gc too often
There's a lot of factors that go into tuning, and I don't know of any reliable formula that you can use to figure out what's going to work optimally for your hardware. Personally I recommend: 1) find the bottleneck 2) playing with a parameter (or two) 3) see what changed, performance wise If you've got a specific question I think someone can find a way to help, but asking what can 8gb of heap give me is pretty abstract and unanswerable. Jon On Sun Dec 07 2014 at 8:03:53 AM Philo Yang ud1...@gmail.com wrote: 2014-12-05 15:40 GMT+08:00 Jonathan Haddad j...@jonhaddad.com: I recommend reading through https://issues.apache. org/jira/browse/CASSANDRA-8150 to get an idea of how the JVM GC works and what you can do to tune it. Also good is Blake Eggleston's writeup which can be found here: http://blakeeggleston.com/ cassandra-tuning-the-jvm-for-read-heavy-workloads.html I'd like to note that allocating 4GB heap to Cassandra under any serious workload is unlikely to be sufficient. Thanks for your recommendation. After reading I try to allocate a larger heap and it is useful for me. 4G heap can't handle the workload in my use case indeed. So another question is, how much pressure dose default max heap (8G) can handle? The pressure may not be a simple qps, you know, slice query for many columns in a row will allocate more objects in heap than the query for a single column. Is there any testing result for the relationship between the pressure and the safety heap size? We know query a slice with many tombstones is not a good use case, but query a slice without tombstones may be a common use case, right? On Thu Dec 04 2014 at 8:43:38 PM Philo Yang ud1...@gmail.com wrote: I have two kinds of machine: 16G RAM, with default heap size setting, about 4G. 64G RAM, with default heap size setting, about 8G. These two kinds of nodes have same number of vnodes, and both of them have gc issue, although the node of 16G have a higher probability of gc issue. Thanks, Philo Yang 2014-12-05 12:34 GMT+08:00 Tim Heckman t...@pagerduty.com: On Dec 4, 2014 8:14 PM, Philo Yang ud1...@gmail.com wrote: Hi,all I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with full gc that sometime there may be one or two nodes full gc more than one time per minute and over 10 seconds each time, then the node will be unreachable and the latency of cluster will be increased. I grep the GCInspector's log, I found when the node is running fine without gc trouble there are two kinds of gc: ParNew GC in less than 300ms which clear the Par Eden Space and enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in more than 200ms, there is only a small number of ParNew GC in log) ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and enlarge Par Eden Space little, each 1-2 hours it will be executed once. However, sometimes ConcurrentMarkSweep will be strange like it shows: INFO [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 - ConcurrentMarkSweep GC in 12648ms. CMS Old Gen: 3579838424 - 3579838464; Par Eden Space: 503316480 - 294794576; Par Survivor Space: 62914528 - 0 INFO [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 - ConcurrentMarkSweep GC in 12227ms. CMS Old Gen: 3579838464 - 3579836512; Par Eden Space: 503316480 - 310562032; Par Survivor Space: 62872496 - 0 INFO [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 - ConcurrentMarkSweep GC in 11538ms. CMS Old Gen: 3579836688 - 3579805792; Par Eden Space: 503316480 - 332391096; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 - ConcurrentMarkSweep GC in 12180ms. CMS Old Gen: 3579835784 - 3579829760; Par Eden Space: 503316480 - 351991456; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 - ConcurrentMarkSweep GC in 10574ms. CMS Old Gen: 3579838112 - 3579799752; Par Eden Space: 503316480 - 366222584; Par Survivor Space: 62914560 - 0 INFO [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 - ConcurrentMarkSweep GC in 11594ms. CMS Old Gen: 3579831424 - 3579817392; Par Eden Space: 503316480 - 388702928; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 - ConcurrentMarkSweep GC in 11463ms. CMS Old Gen: 3579817392 - 3579838424; Par Eden Space: 503316480 - 408992784; Par Survivor Space: 62896720 - 0 INFO [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 - ConcurrentMarkSweep GC in 9576ms. CMS Old Gen: 3579838424 - 3579816424; Par Eden Space: 503316480 - 438633608; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 - ConcurrentMarkSweep GC in 11556ms. CMS Old Gen: 3579816424 - 3579785496; Par Eden Space: 503316480 - 441354856; Par Survivor Space: 62889528 - 0 INFO [Service
Re: How to model data to achieve specific data locality
I think he mentioned 100MB as the max size - planning for 1mb might make your data model difficult to work. On Sun Dec 07 2014 at 12:07:47 PM Kai Wang dep...@gmail.com wrote: Thanks for the help. I wasn't clear how clustering column works. Coming from Thrift experience, it took me a while to understand how clustering column impacts partition storage on disk. Now I believe using seq_type as the first clustering column solves my problem. As of partition size, I will start with some bucket assumption. If the partition size exceeds the threshold I may need to re-bucket using smaller bucket size. On another thread Eric mentions the optimal partition size should be at 100 kb ~ 1 MB. I will use that as the start point to design my bucket strategy. On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky j...@basetechnology.com wrote: It would be helpful to look at some specific examples of sequences, showing how they grow. I suspect that the term “sequence” is being overloaded in some subtly misleading way here. Besides, we’ve already answered the headline question – data locality is achieved by having a common partition key. So, we need some clarity as to what question we are really focusing on And, of course, we should be asking the “Cassandra Data Modeling 101” question of what do your queries want to look like, how exactly do you want to access your data. Only after we have a handle on how you need to read your data can we decide how it should be stored. My immediate question to get things back on track: When you say “The typical read is to load a subset of sequences with the same seq_id”, what type of “subset” are you talking about? Again, a few explicit and concise example queries (in some concise, easy to read pseudo language or even plain English, but not belabored with full CQL syntax.) would be very helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” command, so what are we really talking about? Also, I presume we are talking CQL, but some of the references seem more Thrift/slice oriented. -- Jack Krupansky *From:* Eric Stevens migh...@gmail.com *Sent:* Sunday, December 7, 2014 10:12 AM *To:* user@cassandra.apache.org *Subject:* Re: How to model data to achieve specific data locality Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. Kai, unless I'm misunderstanding something, I don't see why you need to alter the table to add a new seq type. From a data model perspective, these are just new values in a row. If you do have columns which are specific to particular seq_types, data modeling does become a little more challenging. In that case you may get some advantage from using collections (especially map) to store data which applies to only a few seq types. Or defining a schema which includes the set of all possible columns (that's when you're getting into ALTERs when a new column comes or goes). All sequences with the same seq_id tend to grow at the same rate. Note that it is an anti pattern in Cassandra to append to the same row indefinitely. I think you understand this because of your original question. But please note that a sub partitioning strategy which reuses subpartitions will result in degraded read performance after a while. You'll need to rotate sub partitions by something that doesn't repeat in order to keep the data for a given partition key grouped into just a few sstables. A typical pattern there is to use some kind of time bucket (hour, day, week, etc., depending on your write volume). I do note that your original question was about preserving data locality - and having a consistent locality for a given seq_id - for best offline analytics. If you wanted to work for this, you can certainly also include a blob value in your partitioning key, whose value is calculated to force a ring collision with this record's sibling data. With Cassandra's default partitioner of murmur3, that's probably pretty challenging - murmur3 isn't designed to be cryptographically strong (it doesn't work to make it difficult to force a collision), but it's meant to have good distribution (it may still be computationally expensive to force a collision - I'm not that familiar with its internal workings). In this case, ByteOrderedPartitioner would be a lot easier to force a ring collision on, but then you need to work on a good ring balancing strategy to distribute your data evenly over the ring. On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan doanduy...@gmail.com wrote: Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly -- Then use bucketing to avoid too wide partitions Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and
Re: Could ring cache really improve performance in Cassandra?
I would really not recommend using thrift for anything at this point, including your load tests. Take a look at CQL, all development is going there and has in 2.1 seen a massive performance boost over 2.0. You may want to try the Cassandra stress tool included in 2.1, it can stress a table you've already built. That way you can rule out any bugs on the client side. If you're going to keep using your tool, however, it would be helpful if you sent out a link to the repo, since currently we have no way of knowing if you've got a client side bug (data model or code) that's limiting your performance. On Sun Dec 07 2014 at 7:55:16 PM 孔嘉林 kongjiali...@gmail.com wrote: I find under the src/client folder of Cassandra 2.1.0 source code, there is a *RingCache.java* file. It uses a thrift client calling the* describe_ring()* API to get the token range of each Cassandra node. It is used on the client side. The client can use it combined with the partitioner to get the target node. In this way there is no need to route requests between Cassandra nodes, and the client can directly connect to the target node. So maybe it can save some routing time and improve performance. Thank you very much. 2014-12-08 1:28 GMT+08:00 Jonathan Haddad j...@jonhaddad.com: What's a ring cache? FYI if you're using the DataStax CQL drivers they will automatically route requests to the correct node. On Sun Dec 07 2014 at 12:59:36 AM kong kongjiali...@gmail.com wrote: Hi, I'm doing stress test on Cassandra. And I learn that using ring cache can improve the performance because the client requests can directly go to the target Cassandra server and the coordinator Cassandra node is the desired target node. In this way, there is no need for coordinator node to route the client requests to the target node, and maybe we can get the linear performance increment. However, in my stress test on an Amazon EC2 cluster, the test results are weird. Seems that there's no performance improvement after using ring cache. Could anyone help me explain this results? (Also, I think the results of test without ring cache is weird, because there's no linear increment on QPS when new nodes are added. I need help on explaining this, too). The results are as follows: INSERT(write): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 18687 20195 2 1 20793 26403 2 2 22498 21263 4 1 28348 30010 4 3 28631 24413 SELECT(read): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 24498 22802 2 1 28219 27030 2 2 35383 36674 4 1 34648 28347 4 3 52932 52590 Thank you very much, Joy
Re: Can not connect with cqlsh to something different than localhost
Listen address needs the actual address, not the interface. This is best accomplished by setting up proper hostnames for each machine (through DNS or hosts file) and leaving listen_address blank, as it will pick the external ip. Otherwise, you'll need to set the listen address to the IP of the machine you want on each machine. I find the former to be less of a pain to manage. On Mon Dec 08 2014 at 2:49:55 AM Richard Snowden richard.t.snow...@gmail.com wrote: This did not work either. I changed /etc/cassandra.yaml and restarted Cassandra (I even restarted the machine to make 100% sure). What I tried: 1) listen_address: localhost - connection OK (but of course I can't connect from outside the VM to localhost) 2) Set listen_interface: eth0 - connection refused 3) Set listen_address: 192.168.111.136 - connection refused What to do? Try: $ netstat -lnt and see which interface port 9042 is listening on. You will likely need to update cassandra.yaml to change the interface. By default, Cassandra is listening on localhost so your local cqlsh session works. On Sun, 7 Dec 2014 23:44 Richard Snowden richard.t.snow...@gmail.com wrote: I am running Cassandra 2.1.2 in an Ubuntu VM. cqlsh or cqlsh localhost works fine. But I can not connect from outside the VM (firewall, etc. disabled). Even when I do cqlsh 192.168.111.136 in my VM I get connection refused. This is strange because when I check my network config I can see that 192.168.111.136 is my IP: root@ubuntu:~# ifconfig eth0 Link encap:Ethernet HWaddr 00:0c:29:02:e0:de inet addr:192.168.111.136 Bcast:192.168.111.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:fe02:e0de/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:16042 errors:0 dropped:0 overruns:0 frame:0 TX packets:8638 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21307125 (21.3 MB) TX bytes:709471 (709.4 KB) loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:550 errors:0 dropped:0 overruns:0 frame:0 TX packets:550 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:148053 (148.0 KB) TX bytes:148053 (148.0 KB) root@ubuntu:~# cqlsh 192.168.111.136 9042 Connection error: ('Unable to connect to any servers', {'192.168.111.136': error(111, Tried connecting to [('192.168.111.136', 9042)]. Last error: Connection refused)}) What to do?
Re: Could ring cache really improve performance in Cassandra?
I agree with Robert. If you're trying to test Cassandra, test Cassandra using stress. Set a reasonable benchmark, and then you'll be able to aim for that with your client code. Otherwise you're likely to be asking a lot of the wrong questions make incorrect assumptions. On Mon Dec 08 2014 at 12:42:32 AM Robert Stupp sn...@snazy.de wrote: cassandra-stress is a great tool to check whether the sizing of your cluster in combination of your data model will fit your production needs. I.e. without the application :) Removing the application removes any possible bugs from the load test. Sure, it’s a necessary step to do it with your application - but I’d recommend to start with the stress test tool first. Thrift is a deprecated API. I strongly recommend to use the C++ driver (I pretty sure it supports the native protocol). The native protocol achieves approx. twice the performance than thrift via much fewer TCP connections. (Thrift is RPC - means connections usually waste system, application and server resources while waiting for something. Native protocol is a multiplexed protocol.) As John already said, all development effort is spent on CQL3 and native protocol - thift is just supported. With CQL you can you everything that you can do with thrift + more, new stuff. I also recommend to use prepared statements (it automagically works in a distributed cluster with the native protocol) - it eliminates the effort to parse CQL statement again and again. Am 08.12.2014 um 09:26 schrieb 孔嘉林 kongjiali...@gmail.com: Thanks Jonathan, actually I'm wondering how CQL is implemented underlying, a different RPC mechanism? Why it is faster than thrift? I know I'm wrong, but now I just regard CQL as a query language. Could you please help explain to me? I still feel puzzled after reading some docs about CQL. I create table in CQL, and use cql3 API in thrift. I don't know what else I can do with CQL. And I am using C++ to write the client side code. Currently I am not using the C++ driver and want to write some simple functionality by myself. Also, I didn't use the stress test tool provided in the Cassandra distribution because I also want to make sure whether I can achieve good performance as excepted using my client code. I know others have benchmarked Cassandra and got good results. But if I cannot reproduce the satisfactory results, I cannot use it in my case. I will create a repo and send a link later, hope to get your kind help. Thanks very much. 2014-12-08 14:28 GMT+08:00 Jonathan Haddad j...@jonhaddad.com: I would really not recommend using thrift for anything at this point, including your load tests. Take a look at CQL, all development is going there and has in 2.1 seen a massive performance boost over 2.0. You may want to try the Cassandra stress tool included in 2.1, it can stress a table you've already built. That way you can rule out any bugs on the client side. If you're going to keep using your tool, however, it would be helpful if you sent out a link to the repo, since currently we have no way of knowing if you've got a client side bug (data model or code) that's limiting your performance. On Sun Dec 07 2014 at 7:55:16 PM 孔嘉林 kongjiali...@gmail.com wrote: I find under the src/client folder of Cassandra 2.1.0 source code, there is a *RingCache.java* file. It uses a thrift client calling the* describe_ring()* API to get the token range of each Cassandra node. It is used on the client side. The client can use it combined with the partitioner to get the target node. In this way there is no need to route requests between Cassandra nodes, and the client can directly connect to the target node. So maybe it can save some routing time and improve performance. Thank you very much. 2014-12-08 1:28 GMT+08:00 Jonathan Haddad j...@jonhaddad.com: What's a ring cache? FYI if you're using the DataStax CQL drivers they will automatically route requests to the correct node. On Sun Dec 07 2014 at 12:59:36 AM kong kongjiali...@gmail.com wrote: Hi, I'm doing stress test on Cassandra. And I learn that using ring cache can improve the performance because the client requests can directly go to the target Cassandra server and the coordinator Cassandra node is the desired target node. In this way, there is no need for coordinator node to route the client requests to the target node, and maybe we can get the linear performance increment. However, in my stress test on an Amazon EC2 cluster, the test results are weird. Seems that there's no performance improvement after using ring cache. Could anyone help me explain this results? (Also, I think the results of test without ring cache is weird, because there's no linear increment on QPS when new nodes are added. I need help on explaining this, too). The results are as follows: INSERT(write): Node count Replication factor QPS(No ring cache) QPS(ring cache) 1 1 18687 20195
Re: Cassandra Files Taking up Much More Space than CF
You don't need a prime number of nodes in your ring, but it's not a bad idea to it be a multiple of your RF when your cluster is small. On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder n...@whistle.com wrote: Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by nodetool cfstats) is 57.2 GB and there are no snapshots. However, when I look at the size of the data files (using du) the data file for that table is 107GB. Because the C3.2XLarge only have 160 GB of SSD you can see why this quickly becomes a problem. Running nodetool compact didn't reduce the size and neither does running nodetool repair -pr on the node. I also tried nodetool flush and nodetool cleanup (even though I have not added or removed any nodes recently) but it didn't change anything either. In order to keep my cluster up I then stopped and started that node and the size of the data file dropped to 54GB while the total column family size (as reported by nodetool) stayed about the same. Any suggestions as to what I could be doing wrong? Thanks, Nate
Re: Cassandra Files Taking up Much More Space than CF
Well, I personally don't like RF=2. It means if you're using CL=QUORUM and a node goes down, you're going to have a bad time. (downtime) If you're using CL=ONE then you'd be ok. However, I am not wild about losing a node and having only 1 copy of my data available in prod. On Tue Dec 09 2014 at 8:40:37 AM Nate Yoder n...@whistle.com wrote: Thanks Jonathan. So there is nothing too idiotic about my current set-up with 6 boxes each with 256 vnodes each and a RF of 2? I appreciate the help, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 8:31 AM, Jonathan Haddad j...@jonhaddad.com wrote: You don't need a prime number of nodes in your ring, but it's not a bad idea to it be a multiple of your RF when your cluster is small. On Tue Dec 09 2014 at 8:29:35 AM Nate Yoder n...@whistle.com wrote: Hi Ian, Thanks for the suggestion but I had actually already done that prior to the scenario I described (to get myself some free space) and when I ran nodetool cfstats it listed 0 snapshots as expected, so unfortunately I don't think that is where my space went. One additional piece of information I forgot to point out is that when I ran nodetool status on the node it included all 6 nodes. I have also heard it mentioned that I may want to have a prime number of nodes which may help protect against split-brain. Is this true? If so does it still apply when I am using vnodes? Thanks again, Nate -- *Nathanael Yoder* Principal Engineer Data Scientist, Whistle 415-944-7344 // n...@whistle.com On Tue, Dec 9, 2014 at 7:42 AM, Ian Rose ianr...@fullstory.com wrote: Try `nodetool clearsnapshot` which will delete any snapshots you have. I have never taken a snapshot with nodetool yet I found several snapshots on my disk recently (which can take a lot of space). So perhaps they are automatically generated by some operation? No idea. Regardless, nuking those freed up a ton of space for me. - Ian On Mon, Dec 8, 2014 at 8:12 PM, Nate Yoder n...@whistle.com wrote: Hi All, I am new to Cassandra so I apologise in advance if I have missed anything obvious but this one currently has me stumped. I am currently running a 6 node Cassandra 2.1.1 cluster on EC2 using C3.2XLarge nodes which overall is working very well for us. However, after letting it run for a while I seem to get into a situation where the amount of disk space used far exceeds the total amount of data on each node and I haven't been able to get the size to go back down except by stopping and restarting the node. For example, in my data I have almost all of my data in one table. On one of my nodes right now the total space used (as reported by nodetool cfstats) is 57.2 GB and there are no snapshots. However, when I look at the size of the data files (using du) the data file for that table is 107GB. Because the C3.2XLarge only have 160 GB of SSD you can see why this quickly becomes a problem. Running nodetool compact didn't reduce the size and neither does running nodetool repair -pr on the node. I also tried nodetool flush and nodetool cleanup (even though I have not added or removed any nodes recently) but it didn't change anything either. In order to keep my cluster up I then stopped and started that node and the size of the data file dropped to 54GB while the total column family size (as reported by nodetool) stayed about the same. Any suggestions as to what I could be doing wrong? Thanks, Nate
Re: upgrade cassandra from 2.0.6 to 2.1.2
Yes. It is, in general, a best practice to upgrade to the latest bug fix release before doing an upgrade to the next point release. On Tue Dec 09 2014 at 6:58:24 PM wyang wy...@v5.cn wrote: I looked some upgrade documentations and am a little puzzled. According to https://github.com/apache/cassandra/blob/cassandra-2.1/NEWS.txt, “Rolling upgrades from anything pre-2.0.7 is not supported”. It means we should upgrade to 2.0.7 or later first? Can we rolling upgrade to 2.0.7? Do we need upgrade stables after that? It seems nothing specific to note upgrading between 2.0.6 and 2.0.7 in NEWS.txt Any advice will be kindly appreciated
Re: Cassandra Maintenance Best practices
I did a presentation on diagnosing performance problems in production at the US Euro summits, in which I covered quite a few tools preventative measures you should know when running a production cluster. You may find it useful: http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/ On ops center - I recommend it. It gives you a nice dashboard. I don't think it's completely comprehensive (but no tool really is) but it gets you 90% of the way there. It's a good idea to run repairs, especially if you're doing deletes or querying at CL=ONE. I assume you're not using quorum, because on RF=2 that's the same as CL=ALL. I recommend at least RF=3 because if you lose 1 server, you're on the edge of data loss. On Tue Dec 09 2014 at 7:19:32 PM Neha Trivedi nehajtriv...@gmail.com wrote: Hi, We have Two Node Cluster Configuration in production with RF=2. Which means that the data is written in both the clusters and it's running for about a month now and has good amount of data. Questions? 1. What are the best practices for maintenance? 2. Is OPScenter required to be installed or I can manage with nodetool utility? 3. Is is necessary to run repair weekly? thanks regards Neha
Re: batch_size_warn_threshold_in_kb
The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 50 byte mutations. Totally up for debate. It's totally changeable, however, it's there in no small part because so many people confuse the BATCH keyword as a performance optimization, this helps flag those cases of misuse. On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi – The cassandra.yaml file has property called *batch_size_warn_threshold_in_kb. * The default size is 5kb and according to the comments in the yaml file, it is used to log WARN on any batch size exceeding this value in kilobytes. It says caution should be taken on increasing the size of this threshold as it can lead to node instability. Does anybody know the significance of this magic number 5kb? Why would a higher number (say 10kb) lead to node instability? Mohammed -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev[image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: `nodetool cfhistogram` utility script
Hey Jens, Unfortunately the output of the nodetool histograms changes between versions. While I think your script is useful, it's likely to break between versions. You might be interested to weigh in on the JIRA ticket to make the nodetool output machine friendly: https://issues.apache.org/jira/browse/CASSANDRA-5977 On Fri Dec 12 2014 at 5:48:51 AM Jens Rantil jens.ran...@tink.se wrote: Hi, I just quickly put together a tiny utility script to estimate average/mean/min/max/percentiles for `nodetool cfhistogram` latency output. Maybe could be useful to someone else, don’t know. You can find it here: https://gist.github.com/JensRantil/3da67e39f50aaf4f5bce Future improvements would obviously be to not hardcode `us:` and support the other histograms. Also, this logic should maybe even be moved into the `nodetool cfhistogram` since these are fairly common metrics for latency. Cheers, Jens ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter
Re: batch_size_warn_threshold_in_kb
There are cases where it can. For instance, if you batch multiple mutations to the same partition (and talk to a replica for that partition) they can reduce network overhead because they're effectively a single mutation in the eye of the cluster. However, if you're not doing that (and most people aren't!) you end up putting additional pressure on the coordinator because now it has to talk to several other servers. If you have 100 servers, and perform a mutation on 100 partitions, you could have a coordinator that's 1) talking to every machine in the cluster and b) waiting on a response from a significant portion of them before it can respond success or fail. Any delay, from GC to a bad disk, can affect the performance of the entire batch. On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com wrote: Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each mutation is 1000+ bytes, then with just 5 mutations, you will hit that threshold. In addition, Patrick is saying that he does not recommend more than 100 mutations per batch. So why not warn users just on the # of mutations in a batch? Mohammed *From:* Ryan Svihla [mailto:rsvi...@datastax.com] *Sent:* Thursday, December 11, 2014 12:56 PM *To:* user@cassandra.apache.org *Subject:* Re: batch_size_warn_threshold_in_kb Nothing magic, just put in there based on experience. You can find the story behind the original recommendation here https://issues.apache.org/jira/browse/CASSANDRA-6487 Key reasoning for the desire comes from Patrick McFadden: Yes that was in bytes. Just in my own experience, I don't recommend more than ~100
Re: batch_size_warn_threshold_in_kb
To add to Ryan's (extremely valid!) point, your test works because the coordinator is always a replica. Try again using 20 (or 50) nodes. Batching works great at RF=N=3 because it always gets to write to local and talk to exactly 2 other servers on every request. Consider what happens when the coordinator needs to talk to 100 servers. It's unnecessary overhead on the server side. To save network overhead, Cassandra 2.1 added support for response grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster) which massively helps performance. It provides the benefit of batches but without the coordinator overhead. Can you post your benchmark code? On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad j...@jonhaddad.com wrote: There are cases where it can. For instance, if you batch multiple mutations to the same partition (and talk to a replica for that partition) they can reduce network overhead because they're effectively a single mutation in the eye of the cluster. However, if you're not doing that (and most people aren't!) you end up putting additional pressure on the coordinator because now it has to talk to several other servers. If you have 100 servers, and perform a mutation on 100 partitions, you could have a coordinator that's 1) talking to every machine in the cluster and b) waiting on a response from a significant portion of them before it can respond success or fail. Any delay, from GC to a bad disk, can affect the performance of the entire batch. On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky j...@basetechnology.com wrote: Jonathan and Ryan, Jonathan says “It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite”, but I would note that the CQL3 spec says “The BATCH statement ... serves several purposes: 1. It saves network round-trips between the client and the server (and sometimes between the server coordinator and the replicas) when batching multiple updates.” Is the spec inaccurate? I mean, it seems in conflict with your statement. See: https://cassandra.apache.org/doc/cql3/CQL.html I see the spec as gospel – if it’s not accurate, let’s propose a change to make it accurate. The DataStax CQL doc is more nuanced: “Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see Cassandra: Batch loading without the Batch keyword.” Maybe what we really need is a “client/driver-side batch”, which is simply a way to collect “batches” of operations in the client/driver and then let the driver determine what degree of batching and asynchronous operation is appropriate. It might also be nice to have an inquiry for the cluster as to what batch size is most optimal for the cluster, like number of mutations in a batch and number of simultaneous connections, and to have that be dynamic based on overall cluster load. I would also note that the example in the spec has multiple inserts with different partition key values, which flies in the face of the admonition to to refrain from using server-side distribution of requests. At a minimum the CQL spec should make a more clear statement of intent and non-intent for BATCH. -- Jack Krupansky *From:* Jonathan Haddad j...@jonhaddad.com *Sent:* Friday, December 12, 2014 12:58 PM *To:* user@cassandra.apache.org ; Ryan Svihla rsvi...@datastax.com *Subject:* Re: batch_size_warn_threshold_in_kb The really important thing to really take away from Ryan's original post is that batches are not there for performance. The only case I consider batches to be useful for is when you absolutely need to know that several tables all get a mutation (via logged batches). The use case for this is when you've got multiple tables that are serving as different views for data. It is absolutely not going to help you if you're trying to lump queries together to reduce network server overhead - in fact it'll do the opposite. If you're trying to do that, instead perform many async queries. The overhead of batches in cassandra is significant and you're going to hit a lot of problems if you use them excessively (timeouts / failures). tl;dr: you probably don't want batch, you most likely want many async calls On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller moham...@glassbeam.com wrote: Ryan, Thanks for the quick response. I did see that jira before posting my question on this list. However, I didn’t see any information about why 5kb+ data will cause instability. 5kb or even 50kb seems too small. For example, if each
Re: batch_size_warn_threshold_in_kb
One thing to keep in mind is the overhead of a batch goes up as the number of servers increases. Talking to 3 is going to have a much different performance profile than talking to 20. Keep in mind that the coordinator is going to be talking to every server in the cluster with a big batch. The amount of local writes will decrease as it owns a smaller portion of the ring. All you've done is add an extra network hop between your client and where the data should actually be. You also start to have an impact on GC in a very negative way. Your point is valid about topology changes, but that's a relatively rare occurrence, and the driver is notified pretty quickly, so I wouldn't optimize for that case. Can you post your test code in a gist or something? I can't really talk about your benchmark without seeing it and you're basing your stance on the premise that it is correct, which it may not be. On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens migh...@gmail.com wrote: You can seen what the partition key strategies are for each of the tables, test5 shows the least improvement. The set (aid, end) should be unique, and bckt is derived from end. Some of these layouts result in clustering on the same partition keys, that's actually tunable with the ~15 per bucket reported (exact number of entries per bucket will vary but should have a mean of 15 in that run - it's an input parameter to my tests). test5 obviously ends up being exclusively unique partitions for each record. Your points about: 1) Failed batches having a higher cost than failed single statements 2) In my test, every node was a replica for all data. These are both very good points. For #1, since the worst case scenario is nearly twice fast in batches as its single statement equivalent, in terms of impact on the client, you'd have to be retrying half your batches before you broke even there (but of course those retries are not free to the cluster, so you probably make the performance tipping point approach a lot faster). This alone may be cause to justify avoiding batches, or at least severely limiting their size (hey, that's what this discussion is about!). For #2, that's certainly a good point, for this test cluster, I should at least re-run with RF=1 so that proxying times start to matter. If you're not using a token aware client or not using a token aware policy for whatever reason, this should even out though, no? Each node will end up coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are batched or single statements. The DS driver is very careful to caution that the topology map it maintains makes no guarantees on freshness, so you may see a significant performance penalty in your client when the topology changes if you're depending on token aware routing as part of your performance requirements. I'm curious what your thoughts are on grouping statements by primary replica according to the routing policy, and executing unlogged batches that way (so that for token aware routing, all statements are executed on a replica, for others it'd make no difference). Retries are still more expensive, but token aware proxying avoidance is still had. It's pretty easy to do in Scala: def groupByFirstReplica(statements: Iterable[Statement])(implicit session: Session): Map[Host, Seq[Statement]] = { val meta = session.getCluster.getMetadata statements.groupBy { st = meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next } } val result = Future.traverse(groupByFirstReplica(statements).values).map(st = newBatch(st).executeAsync()) Let me get together my test code, it depends on some existing utilities we use elsewhere, such as implicit conversions between Google and Scala native futures. I'll try to put this together in a format that's runnable for you in a Scala REPL console without having to resolve our internal dependencies. This may not be today though. Also, @Ryan, I don't think that shuffling would make a difference for my above tests since as Jon observed, all my nodes were already replicas there. On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla rsvi...@datastax.com wrote: Also..what happens when you turn on shuffle with token aware? http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad j...@jonhaddad.com wrote: To add to Ryan's (extremely valid!) point, your test works because the coordinator is always a replica. Try again using 20 (or 50) nodes. Batching works great at RF=N=3 because it always gets to write to local and talk to exactly 2 other servers on every request. Consider what happens when the coordinator needs to talk to 100 servers. It's unnecessary overhead on the server side. To save network overhead, Cassandra 2.1 added support for response grouping (see http://www.datastax.com/dev/blog/cassandra-2-1-now
Re: batch_size_warn_threshold_in_kb
On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens migh...@gmail.com wrote: Isn't the net effect of coordination overhead incurred by batches basically the same as the overhead incurred by RoundRobin or other non-token-aware request routing? As the cluster size increases, each node would coordinate the same percentage of writes in batches under token awareness as they would under a more naive single statement routing strategy. If write volume per time unit is the same in both approaches, each node ends up coordinating the majority of writes under either strategy as the cluster grows. If you're not token aware, there's extra coordinator overhead, yes. If you are token aware, not the case. I'm operating under the assumption that you'd want to be token aware, since I don't see a point in not doing so :) Unfortunately my Scala isn't the best so I'm going to have to take a little bit to wade through the code. It may be useful to run cassandra-stress (it doesn't seem to have a mode for batches) to get a baseline on non-batches. I'm curious to know if you get different numbers than the scala profiler. GC pressure in the cluster is a concern of course, as you observe. But delta performance is *substantial* from what I can see. As in the case where you're bumping up against retries, this will cause you to fall over much more rapidly as you approach your tipping point, but in a healthy cluster, it's the same write volume, just a longer tenancy in eden. If reasonable sized batches are causing survivors, you're not far off from falling over anyway. On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad j...@jonhaddad.com wrote: One thing to keep in mind is the overhead of a batch goes up as the number of servers increases. Talking to 3 is going to have a much different performance profile than talking to 20. Keep in mind that the coordinator is going to be talking to every server in the cluster with a big batch. The amount of local writes will decrease as it owns a smaller portion of the ring. All you've done is add an extra network hop between your client and where the data should actually be. You also start to have an impact on GC in a very negative way. Your point is valid about topology changes, but that's a relatively rare occurrence, and the driver is notified pretty quickly, so I wouldn't optimize for that case. Can you post your test code in a gist or something? I can't really talk about your benchmark without seeing it and you're basing your stance on the premise that it is correct, which it may not be. On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens migh...@gmail.com wrote: You can seen what the partition key strategies are for each of the tables, test5 shows the least improvement. The set (aid, end) should be unique, and bckt is derived from end. Some of these layouts result in clustering on the same partition keys, that's actually tunable with the ~15 per bucket reported (exact number of entries per bucket will vary but should have a mean of 15 in that run - it's an input parameter to my tests). test5 obviously ends up being exclusively unique partitions for each record. Your points about: 1) Failed batches having a higher cost than failed single statements 2) In my test, every node was a replica for all data. These are both very good points. For #1, since the worst case scenario is nearly twice fast in batches as its single statement equivalent, in terms of impact on the client, you'd have to be retrying half your batches before you broke even there (but of course those retries are not free to the cluster, so you probably make the performance tipping point approach a lot faster). This alone may be cause to justify avoiding batches, or at least severely limiting their size (hey, that's what this discussion is about!). For #2, that's certainly a good point, for this test cluster, I should at least re-run with RF=1 so that proxying times start to matter. If you're not using a token aware client or not using a token aware policy for whatever reason, this should even out though, no? Each node will end up coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are batched or single statements. The DS driver is very careful to caution that the topology map it maintains makes no guarantees on freshness, so you may see a significant performance penalty in your client when the topology changes if you're depending on token aware routing as part of your performance requirements. I'm curious what your thoughts are on grouping statements by primary replica according to the routing policy, and executing unlogged batches that way (so that for token aware routing, all statements are executed on a replica, for others it'd make no difference). Retries are still more expensive, but token aware proxying avoidance is still had. It's pretty easy to do in Scala: def groupByFirstReplica(statements: Iterable
Re: batch_size_warn_threshold_in_kb
Not a problem - it's good to hash this stuff out and understand the technical reasons why something works or doesn't work. On Sat Dec 13 2014 at 10:07:10 AM Jonathan Haddad j...@jonhaddad.com wrote: On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens migh...@gmail.com wrote: Isn't the net effect of coordination overhead incurred by batches basically the same as the overhead incurred by RoundRobin or other non-token-aware request routing? As the cluster size increases, each node would coordinate the same percentage of writes in batches under token awareness as they would under a more naive single statement routing strategy. If write volume per time unit is the same in both approaches, each node ends up coordinating the majority of writes under either strategy as the cluster grows. If you're not token aware, there's extra coordinator overhead, yes. If you are token aware, not the case. I'm operating under the assumption that you'd want to be token aware, since I don't see a point in not doing so :) Unfortunately my Scala isn't the best so I'm going to have to take a little bit to wade through the code. It may be useful to run cassandra-stress (it doesn't seem to have a mode for batches) to get a baseline on non-batches. I'm curious to know if you get different numbers than the scala profiler. GC pressure in the cluster is a concern of course, as you observe. But delta performance is *substantial* from what I can see. As in the case where you're bumping up against retries, this will cause you to fall over much more rapidly as you approach your tipping point, but in a healthy cluster, it's the same write volume, just a longer tenancy in eden. If reasonable sized batches are causing survivors, you're not far off from falling over anyway. On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad j...@jonhaddad.com wrote: One thing to keep in mind is the overhead of a batch goes up as the number of servers increases. Talking to 3 is going to have a much different performance profile than talking to 20. Keep in mind that the coordinator is going to be talking to every server in the cluster with a big batch. The amount of local writes will decrease as it owns a smaller portion of the ring. All you've done is add an extra network hop between your client and where the data should actually be. You also start to have an impact on GC in a very negative way. Your point is valid about topology changes, but that's a relatively rare occurrence, and the driver is notified pretty quickly, so I wouldn't optimize for that case. Can you post your test code in a gist or something? I can't really talk about your benchmark without seeing it and you're basing your stance on the premise that it is correct, which it may not be. On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens migh...@gmail.com wrote: You can seen what the partition key strategies are for each of the tables, test5 shows the least improvement. The set (aid, end) should be unique, and bckt is derived from end. Some of these layouts result in clustering on the same partition keys, that's actually tunable with the ~15 per bucket reported (exact number of entries per bucket will vary but should have a mean of 15 in that run - it's an input parameter to my tests). test5 obviously ends up being exclusively unique partitions for each record. Your points about: 1) Failed batches having a higher cost than failed single statements 2) In my test, every node was a replica for all data. These are both very good points. For #1, since the worst case scenario is nearly twice fast in batches as its single statement equivalent, in terms of impact on the client, you'd have to be retrying half your batches before you broke even there (but of course those retries are not free to the cluster, so you probably make the performance tipping point approach a lot faster). This alone may be cause to justify avoiding batches, or at least severely limiting their size (hey, that's what this discussion is about!). For #2, that's certainly a good point, for this test cluster, I should at least re-run with RF=1 so that proxying times start to matter. If you're not using a token aware client or not using a token aware policy for whatever reason, this should even out though, no? Each node will end up coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are batched or single statements. The DS driver is very careful to caution that the topology map it maintains makes no guarantees on freshness, so you may see a significant performance penalty in your client when the topology changes if you're depending on token aware routing as part of your performance requirements. I'm curious what your thoughts are on grouping statements by primary replica according to the routing policy, and executing unlogged batches that way (so that for token aware routing, all statements are executed on a replica
Re: batch_size_warn_threshold_in_kb
order= 33,686,013,000 Execution Results for 1 runs of 113825 records = 1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches of 10 Total Run Time traverse test3 ((aid, bckt), end, proto) reverse order= 11,030,788,000 traverse test1 ((aid, bckt), proto, end) reverse order= 13,345,962,000 traverse test2 ((aid, bckt), end) = 15,110,208,000 traverse test4 ((aid, bckt), proto, end) no explicit ordering = 16,398,982,000 traverse test5 ((aid, bckt, end)) = 22,166,119,000 For giggles I added token aware batching (grouping statements within a single batch by meta.getReplicas(statement.getKeyspace, statement.getRoutingKey).iterator().next - see https://gist.github.com/ MightyE/1c98912fca104f6138fc#file-testsuite-L176-L189 ), here's that run; comparable results with before, and easily inside one sigma of non-token-aware batching, so not a statistically significant difference. Execution Results for 1 runs of 113825 records = 1 runs of 113,825 records (3 protos, 5 agents, ~15 per bucket) in batches of 10 Total Run Time traverse test2 ((aid, bckt), end) = 11,429,008,000 traverse test1 ((aid, bckt), proto, end) reverse order= 12,593,034,000 traverse test4 ((aid, bckt), proto, end) no explicit ordering = 13,111,244,000 traverse test3 ((aid, bckt), end, proto) reverse order= 25,163,064,000 traverse test5 ((aid, bckt, end)) = 30,233,744,000 On Sat, Dec 13, 2014 at 11:07 AM, Jonathan Haddad j...@jonhaddad.com wrote: On Sat Dec 13 2014 at 10:00:16 AM Eric Stevens migh...@gmail.com wrote: Isn't the net effect of coordination overhead incurred by batches basically the same as the overhead incurred by RoundRobin or other non-token-aware request routing? As the cluster size increases, each node would coordinate the same percentage of writes in batches under token awareness as they would under a more naive single statement routing strategy. If write volume per time unit is the same in both approaches, each node ends up coordinating the majority of writes under either strategy as the cluster grows. If you're not token aware, there's extra coordinator overhead, yes. If you are token aware, not the case. I'm operating under the assumption that you'd want to be token aware, since I don't see a point in not doing so :) Unfortunately my Scala isn't the best so I'm going to have to take a little bit to wade through the code. It may be useful to run cassandra-stress (it doesn't seem to have a mode for batches) to get a baseline on non-batches. I'm curious to know if you get different numbers than the scala profiler. GC pressure in the cluster is a concern of course, as you observe. But delta performance is *substantial* from what I can see. As in the case where you're bumping up against retries, this will cause you to fall over much more rapidly as you approach your tipping point, but in a healthy cluster, it's the same write volume, just a longer tenancy in eden. If reasonable sized batches are causing survivors, you're not far off from falling over anyway. On Sat, Dec 13, 2014 at 10:04 AM, Jonathan Haddad j...@jonhaddad.com wrote: One thing to keep in mind is the overhead of a batch goes up as the number of servers increases. Talking to 3 is going to have a much different performance profile than talking to 20. Keep in mind that the coordinator is going to be talking to every server in the cluster with a big batch. The amount of local writes will decrease as it owns a smaller portion of the ring. All you've done is add an extra network hop between your client and where the data should actually be. You also start to have an impact on GC in a very negative way. Your point is valid about topology changes, but that's a relatively rare occurrence, and the driver is notified pretty quickly, so I wouldn't optimize for that case. Can you post your test code in a gist or something? I can't really talk about your benchmark without seeing it and you're basing your stance on the premise that it is correct, which it may not be. On Sat Dec 13 2014 at 8:45:21 AM Eric Stevens migh...@gmail.com wrote: You can seen what the partition key strategies are for each of the tables, test5 shows the least improvement. The set (aid, end) should be unique, and bckt is derived from end. Some of these layouts result in clustering on the same partition keys, that's actually tunable with the ~15 per bucket reported (exact number of entries per bucket will vary but should have a mean of 15 in that run - it's an input parameter to my tests). test5 obviously ends up being exclusively unique partitions for each record. Your points about: 1) Failed batches having a higher cost than failed single statements 2) In my test, every node
Re: bootstrapping manually when auto_bootstrap=false ?
I'd consider solving your root problem of people are starting and stopping servers in prod accidentally instead of making Cassandra more difficult to manage operationally. On Thu Dec 18 2014 at 4:04:34 AM Ryan Svihla rsvi...@datastax.com wrote: why auto_bootstrap=false? The documentation even suggests the opposite. If you don't auto_bootstrap the node will take queries before it has copies of all the data, and you'll get the wrong answer (it'd not be unlike using CL ONE when you've got a bunch of dropped mutations on a single node in the cluster). On Wed, Dec 17, 2014 at 10:45 PM, Ben Bromhead b...@instaclustr.com wrote: - In Cassandra yaml set auto_bootstrap = false - Boot node - nodetool rebuild Very similar to http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html On 18 December 2014 at 14:04, Kevin Burton bur...@spinn3r.com wrote: I’m trying to figure out the best way to bootstrap our nodes. I *think* I want our nodes to be manually bootstrapped. This way an admin has to explicitly bring up the node in the cluster and I don’t have to worry about a script accidentally provisioning new nodes. The problem is HOW do you do it? I couldn’t find any reference anywhere in the documentation. I *think* I run nodetool repair? but it’s unclear.. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com -- Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | +61 415 936 359 -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: full gc too oftenvAquin p y l mmm am m
This topic comes up quite a bit. Enough, in fact, that I've done a 1 hour webinar on the topic. I cover how the JVM GC works and things you need to consider when tuning it for Cassandra. https://www.youtube.com/watch?v=7B_w6YDYSwA With your specific problem - full GC not reducing the old gen - the most obvious answer is there's not much garbage to collect. Take a look at nodetool tpstats. Do you see lots of blocked MemtableFlushWriters? Jon On Thu Dec 18 2014 at 2:01:00 PM Y.Wong yungmw...@gmail.com wrote: V On Dec 4, 2014 11:14 PM, Philo Yang ud1...@gmail.com wrote: Hi,all I have a cluster on C* 2.1.1 and jdk 1.7_u51. I have a trouble with full gc that sometime there may be one or two nodes full gc more than one time per minute and over 10 seconds each time, then the node will be unreachable and the latency of cluster will be increased. I grep the GCInspector's log, I found when the node is running fine without gc trouble there are two kinds of gc: ParNew GC in less than 300ms which clear the Par Eden Space and enlarge CMS Old Gen/ Par Survivor Space little (because it only show gc in more than 200ms, there is only a small number of ParNew GC in log) ConcurrentMarkSweep in 4000~8000ms which reduce CMS Old Gen much and enlarge Par Eden Space little, each 1-2 hours it will be executed once. However, sometimes ConcurrentMarkSweep will be strange like it shows: INFO [Service Thread] 2014-12-05 11:28:44,629 GCInspector.java:142 - ConcurrentMarkSweep GC in 12648ms. CMS Old Gen: 3579838424 - 3579838464; Par Eden Space: 503316480 - 294794576; Par Survivor Space: 62914528 - 0 INFO [Service Thread] 2014-12-05 11:28:59,581 GCInspector.java:142 - ConcurrentMarkSweep GC in 12227ms. CMS Old Gen: 3579838464 - 3579836512; Par Eden Space: 503316480 - 310562032; Par Survivor Space: 62872496 - 0 INFO [Service Thread] 2014-12-05 11:29:14,686 GCInspector.java:142 - ConcurrentMarkSweep GC in 11538ms. CMS Old Gen: 3579836688 - 3579805792; Par Eden Space: 503316480 - 332391096; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:29:29,371 GCInspector.java:142 - ConcurrentMarkSweep GC in 12180ms. CMS Old Gen: 3579835784 - 3579829760; Par Eden Space: 503316480 - 351991456; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:29:45,028 GCInspector.java:142 - ConcurrentMarkSweep GC in 10574ms. CMS Old Gen: 3579838112 - 3579799752; Par Eden Space: 503316480 - 366222584; Par Survivor Space: 62914560 - 0 INFO [Service Thread] 2014-12-05 11:29:59,546 GCInspector.java:142 - ConcurrentMarkSweep GC in 11594ms. CMS Old Gen: 3579831424 - 3579817392; Par Eden Space: 503316480 - 388702928; Par Survivor Space: 62914552 - 0 INFO [Service Thread] 2014-12-05 11:30:14,153 GCInspector.java:142 - ConcurrentMarkSweep GC in 11463ms. CMS Old Gen: 3579817392 - 3579838424; Par Eden Space: 503316480 - 408992784; Par Survivor Space: 62896720 - 0 INFO [Service Thread] 2014-12-05 11:30:25,009 GCInspector.java:142 - ConcurrentMarkSweep GC in 9576ms. CMS Old Gen: 3579838424 - 3579816424; Par Eden Space: 503316480 - 438633608; Par Survivor Space: 62914544 - 0 INFO [Service Thread] 2014-12-05 11:30:39,929 GCInspector.java:142 - ConcurrentMarkSweep GC in 11556ms. CMS Old Gen: 3579816424 - 3579785496; Par Eden Space: 503316480 - 441354856; Par Survivor Space: 62889528 - 0 INFO [Service Thread] 2014-12-05 11:30:54,085 GCInspector.java:142 - ConcurrentMarkSweep GC in 12082ms. CMS Old Gen: 3579786592 - 3579814464; Par Eden Space: 503316480 - 448782440; Par Survivor Space: 62914560 - 0 In each time Old Gen reduce only a little, Survivor Space will be clear but the heap is still full so there will be another full gc very soon then the node will down. If I restart the node, it will be fine without gc trouble. Can anyone help me to find out where is the problem that full gc can't reduce CMS Old Gen? Is it because there are too many objects in heap can't be recycled? I think review the table scheme designing and add new nodes into cluster is a good idea, but I still want to know if there is any other reason causing this trouble. Thanks, Philo Yang
Re: simple data movement ?
It may be more valuable to set up your test cluster as the same version, and make sure your tokens are the same. then copy over your sstables. you'll have an exact replica of prod you can test your upgrade process. On Fri Dec 19 2014 at 11:04:58 AM Ryan Svihla rsvi...@datastax.com wrote: In theory, you could always do a data dump ..sstable to json and back for example, but you'd have to have your schema setup ,and I've not actually done this myself so YMMV. I've helped a bunch of folks with that upgrade path and while it's time consuming it does work. On Fri, Dec 19, 2014 at 8:49 AM, Langston, Jim jim.langs...@dynatrace.com wrote: Thanks, this looks uglier , I double checked my production cluster ( I have a staging and development cluster as well ) and production is on 1.2.8. A copy of the data resulted in a mssage : Exception encountered during startup: Incompatible SSTable found. Current version ka is unable to read file: /cassandra/apache-cassandra-2.1.2/bin/../data/data/system/schema_keyspaces/system-schema_keyspaces-ic-150. Please run upgradesstables. Is the move going to to be 1.2.8 -- 1.2.9 -- 2.0.x -- 2.1.2 ?? Can I just dump the data and import it into 2.1.2 ?? Jim From: Ryan Svihla rsvi...@datastax.com Reply-To: user@cassandra.apache.org Date: Thu, 18 Dec 2014 06:00:09 -0600 To: user@cassandra.apache.org Subject: Re: simple data movement ? I'm not sure that'll work with that many version moves in the middle, upgrades are to my knowledge only tested between specific steps, namely from 1.2.9 to the latest 2.0.x http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html Specifically: Cassandra 2.0.x restrictions¶ http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_ubt_nwr_54 After downloading DataStax Community http://planetcassandra.org/cassandra/, upgrade to Cassandra directly from Cassandra 1.2.9 or later. Cassandra 2.0 is not network- or SSTable-compatible with versions older than 1.2.9. If your version of Cassandra is earlier than 1.2.9 and you want to perform a rolling restart http://www.datastax.com/documentation/cassandra/1.2/cassandra/glossary/gloss_rolling_restart.html, first upgrade the entire cluster to 1.2.9, and then to Cassandra 2.0. Cassandra 2.1.x restrictions¶ http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_qzx_pwr_54 Upgrade to Cassandra 2.1 from Cassandra 2.0.7 or later. Cassandra 2.1 is not compatible with Cassandra 1.x SSTables. First upgrade the nodes to Cassandra 2.0.7 or later, start the cluster, upgrade the SSTables, stop the cluster, and then upgrade to Cassandra 2.1. On Wed, Dec 17, 2014 at 10:55 PM, Ben Bromhead b...@instaclustr.com wrote: Just copy the data directory from each prod node to your test node (and relevant configuration files etc). If your IP addresses are different between test and prod, follow https://engineering.eventbrite.com/changing-the-ip-address-of-a-cassandra-node-with-auto_bootstrapfalse/ On 18 December 2014 at 09:10, Langston, Jim jim.langs...@dynatrace.com wrote: Hi all, I have set up a test environment with C* 2.1.2, wanting to test our applications against it. I currently have C* 1.2.9 in production and want to use that data for testing. What would be a good approach for simply taking a copy of the production data and moving it into the test env and having the test env C* use that data ? The test env. is identical is size, with the difference being the versions of C*. Thanks, Jim The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it -- Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | +61 415 936 359 -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not