Cassandra data model misconceptions, and their sources
Ok, here are the common Cassandra misconceptions, and their sources, gleaned from experience and talking to various people. Not listed in any particular order. 1. A key is global, and data in different column families must be related. - BigTable paper - key precedence in Thrift API 2. Table is like a row-oriented table - the name - somewhat fixed by changing to keyspace 3. Keyspace is not like a database (in SQL/CouchDB/MongoDB) - because it's not called that 4. Columns are literally columnar - the name - column sets are stored per key, not per column family (unlike relational DBs) - column name as a piece of data is unusual (esp. in relational DBs) 5. Columns are versioned - BigTable paper 6. Super columns are magical - Name has no precendence anywhere - Super columns do not have timestamps unlike columns - Other MVAs are not fully recursive; just have values 7. Difference between column family, column, and super column is not clear - Everything has column in the name - super, family, and are not well-understood 8. Cassandra uses Paxos - BigTable paper 9. Cassandra uses client-side conflict resolution - Dynamo paper A lot of things to get wrong, right off the bat. Maybe this makes it clear why the BigTable references were not helpful to us? For a new user, it provides as many wrong assumptions as correct assumptions. Evan -- Evan Weaver
Re: [VOTE] Release cassandra 0.4.0-beta1
+1 from me. On Aug 18, 2009, at 2:12 AM, ant elder wrote: On Mon, Aug 17, 2009 at 4:55 PM, Eric Evanseev...@rackspace.com wrote: On Mon, 2009-08-17 at 13:00 +0100, sebb wrote: Given whats being said in the Thrift release legal issues thread i think it should be ok to have the 3rd party licenses separate, I disagree. It must be possible to find all the LICENSE files starting at the initial LICENSE file. At the very least, the initial LICENSE file should have pointers to the other license files. the NOTICE file looks acceptable to me too. AIUI, the NOTICE file needs to give attributions to all 3rd party code included in the propose release. When preparing for the 0.3.0[0] release I spent a great deal of time trying to get all of this right. I looked at list threads for both successful and failed podling release votes, I looked at what top- level projects were doing, and I read through what documentation I could find. This wasn't as helpful as I'd have liked because the documents are non-normative and the application is inconsistent, (and occasionally contradictory). So I did the best I could. The conclusion I came to with respect to NOTICE.txt was that it existed for purposes of attribution, and was specifically in response to section 4(d) of the Apache License. As a result, the NOTICE.txt in the (approved )0.3.0 artifacts and the proposed 0.4.0, contains two attribution statements, one for the Apache licensed Groovy, and one for software developed by The Apache Software Foundation which should cover everything else that is Apache licensed. The conclusion I came to for LICENSE.txt was that it was for including the full license text applicable to the project itself. Both of the above conclusions seemed consistent with at least some successful podling releases, and with some ASF top-level projects, and (to the best of my knowledge), all of the license requirements for our third-party dependencies are being met. However, I'd be happy to go back and correct any shortcomings and re-roll the artifacts if that will get us the votes we need to make a release. I just wish things were more consistent and that the process required a little less groping around in the dark. [0] http://www.mail-archive.com/gene...@incubator.apache.org/ msg21853.html -- Eric Evans eev...@rackspace.com Ok, +1 from me to release. Happy to reconsider if anyone can find an actual link to some evidence of specific policy that says how this release is done is not ok. ...ant -- Ian Holsman i...@holsman.net
Re: Cassandra data model misconceptions, and their sources
I find the diagrams of Evan and folks (http://blog.evanweaver.com/files/cassandra/twitter.jpg) much easier to grok than any particular naming scheme. Annotating that diagram with specific implementations or constraints, like your wiki page, is a great addition. .. Adam On Mon, Aug 17, 2009 at 3:32 PM, Mark McBridemark.mcbr...@gmail.com wrote: My first attempt at a revamped data model wiki page is up here http://wiki.apache.org/cassandra/DataModel2 This one follows phatduckk's approach of describing the data model bottom up, which I found more intuitive. I'm interested to hear if 1) It corrects some of the misconceptions people have run into 2) The bottom up approach is more approachable than top down. 3) I got everything covered and everything right :) ---Mark On Mon, Aug 17, 2009 at 11:31 AM, Edward Ribeiroedward.ribe...@gmail.com wrote: Right on target, Evan! When I first downloaded Cassandra, three months ago, I tried to make the analogy with BigTable, whose paper I'd already read, but the differences between Cassandra and BigTable made it quite hard to grasp some Cassandra concepts. Imho, as Table was renamed to Keyspace then Column should be the next concept to be renamed as showed by numbers 4, 5, 6, and 7 of your list. I would suggest to rename Column to Attribute (with the corresponding AttributeFamily or AttributeSet). It's not the best name, but right off the bat is what I can suggest. Edward
Re: Cassandra data model misconceptions, and their sources
I've been thinking about this for a number of days, and again, while I am not a developer I thought I might toss in a proposal if that's okay. Since putting together a schema diagram and having a number of people review it, I think a change is warranted. Too many people are coming from the RDBMS world and the terms used by Cassandra are conflicting with those terms they are already familiar with. The TLDR version is as follows: Object (Column) ObjectFamily (ColumnFamily) Directory (Row) ObjectContainer (SuperColumn) Namespace (Keyspace) The long version... Object (Column) As Evan has stated repeatedly, column is a bit misleading especially when compared to other types of database systems. I think this is probably the most important change to the data model names, and exactly where I started since this is the 'core' of Cassandra. Object gives the impression that this is a piece of data, it's relatively structured but the name gives no impression how strict that structure is. 'Objects' have names that have values and timestamps. Simple and too the point. 'Object' doesn't come with the preconceived notions that 'column' comes with and leaves room for Cassandra to define what an 'object' is without any conflict to preexisting data structures. By changing this, we can move up the ladder to other data types and easily rename them to something that 'contains objects' or 'accesses objects'. This allows us to describe the data model in the name structure without having to get too deep into the definition. Directory (Row) 'row' is currently unnamed, but still a structure that exists in the model. It's not specifically data itself, but more of a mapping of how to get to objects (using keys). 'Directory' fills this void quite well. It is easily explained as a path to get to data and not data itself. ObjectFamily (ColumnFamily) There's no argument that the one direct link to the BigTable paper is 'column families'. It's perhaps the only structure that is virtually the same in both pieces of software. Considering this, I think we need to avoid too drastic a change. With that said, I think a change is necessary due to the differences in columns between the two databases. 'object family' is descriptive of the relation between objects and removes any reference to tabular structures while keeping a loose relationship to 'column family' in the BigTable paper. ObjectContainer (SuperColumn) I could see this being shortened to 'container' in every day conversation. However, 'objectcontainer' fits nicely with the rest of the data model names and is descriptive of it's purpose and use. Ultimately a 'supercolumn' is nothing more than a named container of columns (and I've seen on at least 3 different occasions the word container used to describe supercolumns). 'supercolumn' had no real connection to what exactly it was defining, but with 'object container' we have a clear understanding that we are naming the structure that holds objects. Or as I explained it to a friend, we are naming the 'jar' and not the 'honey'. :) Namespace (Keyspace) This one I go back and forth on. I know it's been changed from 'Table' to 'keyspace' and Evan proposed 'database', but I think that 'namespace' is really what it is we are talking about. Wikipedia has this as the first line to describe 'namespace': A namespace is an abstract container or environment created to hold a logical grouping of unique identifiers or symbols (i.e., names). Originally I thought 'objectspace' would fit better, but I think 'namespace' comes with a better history and is clearer to what this structure really is. Especially when you relate the name namespace to how it is used in Ruby, Python and Java. Ultimately though, I think I prefer 'keyspace' over 'table' or 'database'. The only issue I see with all of these names is the potential conflict with programming languages and their objects. I know next to nothing about Java so I don't know if there would be a conflict here. I've ran the following Google search 'reserved words in *' where '*' is Ruby, Python, Java and C++ and received no mention of 'object' being a reserved word in any of those languages. I also grep'd through current source code and there doesn't seem to be any real conflicts that couldn't be named something else so as not to conflict with this naming structure. In the end, I think it's a good idea to look at this and work out a solution. Documentation and tutorials are going to help, but I think people are so entrenched in the RDBMS world that there is somewhat of a barrier to understanding Cassandra's data model. Thanks for your time, -- # Curt Micol