Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "DataModelv2" page has been changed by ronaldmathies. http://wiki.apache.org/cassandra/DataModelv2?action=diff&rev1=14&rev2=15 -------------------------------------------------- = Introduction = + Cassandra has a data model that far different from normal relational databases, instead of having schemas, tables and column the data + model consists of a structure of lists and maps. + + When we start to look from the highest level we have clusters, clusters are physical machines operating together and forming a logical + Cassandra instance. A cluster can contain several keyspaces. A keyspace is very similar to a relation database schema and contains + a number of !ColumnFamilies. !ColumnFamilies can be compared to a table in a relation database. And a !ColumnFamily contains Columns. + + A !ColumnFamily comes in two flavors, the first one we already described, which is a !ColumnFamily which has columns. The second one + is also called a !SuperColumnFamily, this one contains !SuperColumns where the !SuperColumns contain a list of Columns. If it becomes + confusing just read on, it will become clearer. + + We'll start from the bottom up, moving from the leaves of Cassandra's data structure (columns) up to the root of the tree (the cluster). + == Columns == + + A Column is also known as a Tuple (triplet), it contains a name, value and a timestamp. + + All values are supplied by the client, including the 'timestamp'. This means that clocks on the clients should be synchronized (in the Cassandra server environment is useful also), as these timestamps are used for conflict resolution. In many cases the 'timestamp' is not used in client applications, and it becomes convenient to think of a column as a name/value pair. For the remainder of this document, 'timestamps' will be elided for readability. It is also worth noting the name and value are binary values, although in many applications they are UTF8 serialized strings. + + Timestamps can be anything you like, but microseconds since 1970 is a convention. Whatever you use, it must be consistent across the application otherwise earlier changes may overwrite newer ones. Model representation: @@ -28, +47 @@ == ColumnFamily == + A column family is a container for columns, analogous to the table in a relational system. You define column families in your storage-conf.xml file, and cannot modify them (or add new column families) without restarting your Cassandra process. A column family holds an ordered list of columns, which you can reference by the column name. + + Column families have a configurable ordering applied to the columns within each row, which affects the behavior of the get_slice call in the thrift API. Out of the box ordering implementations include ASCII, UTF-8, Long, and UUID (lexical or time). + + An example configuration of an Authors !ColumnFamily using the UTF-8 sorting implementation would be: + + <Keyspaces> + <Keyspace Name="Blog"> + <!ColumnFamily !CompareWith="UTF8Type" Name="Authors"/> + </Keyspace> + </Keyspaces> + + In Cassandra, each column family is stored in a separate file, and the file is sorted in row (i.e. key) major order. Related columns, those that you'll access together, should be kept within the same column family. + + The row key is what determines what machine data is stored on. A key can be used for several column families at the same time, this does however not imply that the data from these column families is related. The semantics of having data for the same key in two different column families is entirely up to the client. Also, the columns can be different between the two column families. In fact there may be a virtually unlimited set of column names defined, which leads to fairly common use of the column name as a piece of runtime populated data. This is unusual in storage systems, particularly if you're coming from the relational database world. For each key you can have data from multiple column families associated with it. However, these are logically distinct, which is why the Thrift interface is oriented around accessing one !ColumnFamily per key at a time. On the other hand, a number of methods within the Thrift interface make use of this functionality, for example the batch_insert and batch_mutate make it possible to insert or modify data in multiple !ColumnFamilies at the same time, as long as the key for the different column families are the same. + Model representation: - ||<-2> '''Column Family''' || + ||<-2> '''!ColumnFamily''' || || '''key''' || '''list''' || || binary || 1 .. * Columns || Data representation: - ||<-4> '''Column Family''' || + ||<-4> '''!ColumnFamily''' || || '''key''' ||<-3> '''Columns''' || || 1 || '''name''' || '''value''' || '''timestamp''' || || || "firstname" || "Ronald" || 1270073054 || @@ -47, +82 @@ || || "lastname" || "Steward" || 1270084021 || || || "birthday" || "01/01/1982" || 1270084021 || + As you can see in this example, we have a !ColumnFamily containing two rows identified by the keys "1" and "2", each row has a number of + columns, in this example we have the columns, firstname, lastname and birthday for each row. + - == Super Column == + == SuperColumn == + + A !SuperColumn is very similar to a !ColumnFamily, it consists of a key and a list of columns. Model representation: - ||<-2> '''Super Column''' || + ||<-2> '''!SuperColumn''' || || '''key''' || '''list''' || || binary || 1 .. * Columns || - TODO - Data representation: - ||<-4> '''Super Column''' || + ||<-4> '''!SuperColumn''' || || '''key''' ||<-3> '''Columns''' || || 1 || '''name''' || '''value''' || '''timestamp''' || || || "firstname" || "Ronald" || 1270073054 || @@ -70, +108 @@ || || "lastname" || "Steward" || 1270084021 || || || "birthday" || "01/01/1982" || 1270084021 || + As you can see it looks the same as a !ColumnFamily, the only difference is the usage, a !SuperColumn is used within a !SuperColumnFamily, so + it adds an extra layer in your data structure, instead of having only a row which consists of a key and a list of columns we can now have a row + which consists of a key and a list of super columns which by itself has keys and per key a list of columns. + == SuperColumnFamily == + + The !SuperColumnFamily isn't much different from a normal !ColumnFamily except that it contains a list of super columns per row instead of + a list of columns. To following example defines a super column family in your storage-conf.xml: + + An example configuration of an Authors !ColumnFamily using the UTF-8 sorting implementation would be: + + <Keyspaces> + <Keyspace Name="Blog"> + <!ColumnFamily !ColumnType="Super" !CompareWith="UTF8Type" !CompareSubcolumnsWith="UTF8Type" Name="Posts"/> + </Keyspace> + </Keyspaces> + + The !ColumnType tells cassandra that the Posts columns family is a super column family, the !CompareSubcolumnsWith attribute defines the sorting behavior of the keys of the super columns. Model representation: - ||<-2> '''Column Family''' || + ||<-2> '''!SuperColumnFamily''' || || '''key''' || '''list''' || - || binary || 1 .. * Super Columns || + || binary || 1 .. * !SuperColumns || Data representation: - ||<-5> '''Super Column Family''' || + ||<-5> '''!SuperColumnFamily''' || - || '''Key''' ||<-4> '''Super Columns''' || + || '''Key''' ||<-4> '''!SuperColumns''' || - || "my-new-guitar" || '''key''' ||<-3> '''Columns''' || + || "my-new-guitar" || '''key''' ||<-3> '''Columns''' || - || || post || '''name''' || '''value''' || '''timestamp''' || + || || post || '''name''' || '''value''' || '''timestamp''' || - || || || "subject" || "My new guitar" || 1270073054 || + || || || "subject" || "My new guitar" || 1270073054 || - || || || "body" || "a lot of text" || 1270073054 || + || || || "body" || "a lot of text" || 1270073054 || - || || || "created" || "01/01/2010" || 1270073054 || + || || || "created" || "01/01/2010" || 1270073054 || - || || tags || '''name''' || '''value''' || '''timestamp''' || + || || tags || '''name''' || '''value''' || '''timestamp''' || - || || || "tag0" || "Guitar" || 1270084021 || + || || || "tag0" || "Guitar" || 1270084021 || - || || || "tag1" || "Instrument" || 1270084021 || + || || || "tag1" || "Instrument" || 1270084021 || - || "guitar-lessons" || '''key''' ||<-3 #BBBBBB> Columns || + || "guitar-lessons" || '''key''' ||<-3> '''Columns''' || - || || post ||<#AAAAAA> name ||<#AAAAAA> value ||<#AAAAAA> timestamp || + || || post || '''name''' || '''value''' || '''timestamp''' || - || || || "subject" || "Guitar lessons" || 1270073054 || + || || || "subject" || "Guitar lessons" || 1270073054 || - || || || "body" || "a lot of text" || 1270073054 || + || || || "body" || "a lot of text" || 1270073054 || - || || || "created" || "01/03/2010" || 1270073054 || + || || || "created" || "01/03/2010" || 1270073054 || - || || tags ||<#AAAAAA> name ||<#AAAAAA> value ||<#AAAAAA> timestamp || + || || tags || '''name''' || '''value''' || '''timestamp''' || - || || || "tag0" || "Guitar" || 1270084021 || + || || || "tag0" || "Guitar" || 1270084021 || - || || || "tag1" || "Instrument" || 1270084021 || + || || || "tag1" || "Instrument" || 1270084021 || - || || || "tag2" || "Lessons" || 1270084021 || + || || || "tag2" || "Lessons" || 1270084021 || - The basic concepts are: + This example show that we have two blog postings with the keys "my-new-guitar" and "guitar-lessons", per row we have two super columns, one for the post information and one for the tag information. Per super column we have a number of columns. For the post super column we have the subject, body and created columns. For the tags super column we have the tag0, tag1 and sometimes a tag2 column. - * Cluster: the machines (nodes) in a logical Cassandra instance. Clusters can contain multiple keyspaces. - * Keyspace: a namespace for ColumnFamilies, typically one per application. - * ColumnFamilies contain multiple columns, each of which has a name, value, and a timestamp, and which are referenced by row keys. - * SuperColumns can be thought of as columns that themselves have subcolumns. - We'll start from the bottom up, moving from the leaves of Cassandra's data structure (columns) up to the root of the tree (the cluster). + == Keyspaces == + A keyspace is the first dimension of the Cassandra hash, and is the container for column families. Keyspaces are of roughly the same granularity as a schema or database (i.e. a logical collection of tables) in the RDBMS world. They are the configuration and management point for column families, and is also the structure on which batch inserts are applied. In most cases you will have one keyspace for an application. + + == Modeling your application == + + Unlike with relational systems, where you model entities and relationships and then just add indexes to support whatever queries become necessary, with Cassandra you need to think about what queries you want to support efficiently ahead of time, and model appropriately. Since there are no automatically-provided indexes, you will be much closer to one !ColumnFamily per query than you would have been with tables:queries relationally. Don't be afraid to denormalize accordingly; Cassandra is much, much faster at writes than relational systems. + + Arin Sarkissian of Digg has an excellent post detailing [[http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model|Cassandra's data model]] with highly illustrative examples. + Ronald Mathies has some nice postings for when you want to use Cassandra with Java, if you have read this then you can start from part three [[http://www.sodeso.nl/?p=207|Installing and using Apache Cassandra With Java Part 3 (Data model 2)]]. + + == Furter reading == + + See the CassandraLimitations page for other things to keep in mind when designing a model. + See the [[http://wiki.apache.org/cassandra/API|Thrift API]] page to see what methods are available and how to use them. + See the StorageConfiguration page for more information on how to configure the storage-conf.xml file. + + = Author = + + [[http://www.sodeso.nl/?cat=10|Ronald Mathies]] +
