[Cassandra Wiki] Update of "DataModelv2" by ronaldmathi es

Apache Wiki Sat, 03 Apr 2010 14:21:05 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "DataModelv2" page has been changed by ronaldmathies.
http://wiki.apache.org/cassandra/DataModelv2?action=diff&rev1=14&rev2=15

--------------------------------------------------

  
  = Introduction =
  
+ Cassandra has a data model that far different from normal relational 
databases, instead of having schemas, tables and column the data
+ model consists of a structure of lists and maps.
+ 
+ When we start to look from the highest level we have clusters, clusters are 
physical machines operating together and forming a logical 
+ Cassandra instance. A cluster can contain several keyspaces. A keyspace is 
very similar to a relation database schema and contains
+ a number of !ColumnFamilies. !ColumnFamilies can be compared to a table in a 
relation database. And a !ColumnFamily contains Columns.
+ 
+ A !ColumnFamily comes in two flavors, the first one we already described, 
which is a !ColumnFamily which has columns. The second one
+ is also called a !SuperColumnFamily, this one contains !SuperColumns where 
the !SuperColumns contain a list of Columns. If it becomes 
+ confusing just read on, it will become clearer.
+ 
+ We'll start from the bottom up, moving from the leaves of Cassandra's data 
structure (columns) up to the root of the tree (the cluster).
+ 
  == Columns ==
+ 
+ A Column is also known as a Tuple (triplet), it contains a name, value and a 
timestamp.
+ 
+ All values are supplied by the client, including the 'timestamp'. This means 
that clocks on the clients should be synchronized (in the Cassandra server 
environment is useful also), as these timestamps are used for conflict 
resolution. In many cases the 'timestamp' is not used in client applications, 
and it becomes convenient to think of a column as a name/value pair. For the 
remainder of this document, 'timestamps' will be elided for readability. It is 
also worth noting the name and value are binary values, although in many 
applications they are UTF8 serialized strings.
+ 
+ Timestamps can be anything you like, but microseconds since 1970 is a 
convention. Whatever you use, it must be consistent across the application 
otherwise earlier changes may overwrite newer ones.
  
  Model representation:
  
@@ -28, +47 @@

  
  == ColumnFamily ==
  
+ A column family is a container for columns, analogous to the table in a 
relational system. You define column families in your storage-conf.xml file, 
and cannot modify them (or add new column families) without restarting your 
Cassandra process. A column family holds an ordered list of columns, which you 
can reference by the column name.
+ 
+ Column families have a configurable ordering applied to the columns within 
each row, which affects the behavior of the get_slice call in the thrift API. 
Out of the box ordering implementations include ASCII, UTF-8, Long, and UUID 
(lexical or time).
+ 
+ An example configuration of an Authors !ColumnFamily using the UTF-8 sorting 
implementation would be:
+ 
+ <Keyspaces>
+   <Keyspace Name="Blog">
+     <!ColumnFamily !CompareWith="UTF8Type" Name="Authors"/>
+   </Keyspace>
+ </Keyspaces>
+ 
+ In Cassandra, each column family is stored in a separate file, and the file 
is sorted in row (i.e. key) major order. Related columns, those that you'll 
access together, should be kept within the same column family.
+ 
+ The row key is what determines what machine data is stored on. A key can be 
used for several column families at the same time, this does however not imply 
that the data from these column families is related. The semantics of having 
data for the same key in two different column families is entirely up to the 
client. Also, the columns can be different between the two column families. In 
fact there may be a virtually unlimited set of column names defined, which 
leads to fairly common use of the column name as a piece of runtime populated 
data. This is unusual in storage systems, particularly if you're coming from 
the relational database world. For each key you can have data from multiple 
column families associated with it. However, these are logically distinct, 
which is why the Thrift interface is oriented around accessing one 
!ColumnFamily per key at a time. On the other hand, a number of methods within 
the Thrift interface make use of this functionality, for example the 
batch_insert and batch_mutate make it possible to insert or modify data in 
multiple !ColumnFamilies at the same time, as long as the key for the different 
column families are the same. 
+ 
  Model representation:
  
- ||<-2> '''Column Family'''     ||
+ ||<-2> '''!ColumnFamily'''     ||
  || '''key''' || '''list'''     ||
  || binary    || 1 .. * Columns ||
  
  Data representation:
  
- ||<-4> '''Column Family'''                                     ||
+ ||<-4> '''!ColumnFamily'''                                     ||
  || '''key''' ||<-3>  '''Columns'''                             ||
  || 1         || '''name'''  || '''value'''  || '''timestamp''' ||
  ||           || "firstname" || "Ronald"     || 1270073054      ||
@@ -47, +82 @@

  ||           || "lastname"  || "Steward"    || 1270084021      ||
  ||           || "birthday"  || "01/01/1982" || 1270084021      ||
  
+ As you can see in this example, we have a !ColumnFamily containing two rows 
identified by the keys "1" and "2", each row has a number of 
+ columns, in this example we have the columns, firstname, lastname and 
birthday for each row.
+ 
- == Super Column ==
+ == SuperColumn ==
+ 
+ A !SuperColumn is very similar to a !ColumnFamily, it consists of a key and a 
list of columns.
  
  Model representation:
  
- ||<-2> '''Super Column'''      ||
+ ||<-2> '''!SuperColumn'''      ||
  || '''key''' || '''list'''     ||
  || binary    || 1 .. * Columns ||
  
- TODO
- 
  Data representation:
  
- ||<-4> '''Super Column'''                                      ||
+ ||<-4> '''!SuperColumn'''                                      ||
  || '''key''' ||<-3> '''Columns'''                              ||
  || 1         || '''name'''  || '''value'''  || '''timestamp''' ||
  ||           || "firstname" || "Ronald"     || 1270073054      ||
@@ -70, +108 @@

  ||           || "lastname"  || "Steward"    || 1270084021      ||
  ||           || "birthday"  || "01/01/1982" || 1270084021      ||
  
+ As you can see it looks the same as a !ColumnFamily, the only difference is 
the usage, a !SuperColumn is used within a !SuperColumnFamily, so
+ it adds an extra layer in your data structure, instead of having only a row 
which consists of a key and a list of columns we can now have a row
+ which consists of a key and a list of super columns which by itself has keys 
and per key a list of columns.
+ 
  == SuperColumnFamily ==
+ 
+ The !SuperColumnFamily isn't much different from a normal !ColumnFamily 
except that it contains a list of super columns per row instead of
+ a list of columns. To following example defines a super column family in your 
storage-conf.xml:
+ 
+ An example configuration of an Authors !ColumnFamily using the UTF-8 sorting 
implementation would be:
+ 
+ <Keyspaces>
+   <Keyspace Name="Blog">
+     <!ColumnFamily !ColumnType="Super" !CompareWith="UTF8Type" 
!CompareSubcolumnsWith="UTF8Type" Name="Posts"/>
+   </Keyspace>
+ </Keyspaces> 
+ 
+ The !ColumnType tells cassandra that the Posts columns family is a super 
column family, the !CompareSubcolumnsWith attribute defines the sorting 
behavior of the keys of the super columns.
  
  Model representation:
  
- ||<-2> '''Column Family'''           ||
+ ||<-2> '''!SuperColumnFamily'''      ||
  || '''key''' || '''list'''           ||
- || binary    || 1 .. * Super Columns ||
+ || binary    || 1 .. * !SuperColumns ||
  
  Data representation:
  
- ||<-5> '''Super Column Family'''                                              
       ||
+ ||<-5> '''!SuperColumnFamily'''                                               
        ||
- || '''Key'''        ||<-4> '''Super Columns'''                                
       ||
+ || '''Key'''        ||<-4> '''!SuperColumns'''                                
        ||
- || "my-new-guitar"  || '''key''' ||<-3> '''Columns'''                         
       ||
+ || "my-new-guitar"  || '''key''' ||<-3> '''Columns'''                         
        ||
- ||                  || post      || '''name''' || '''value'''     || 
'''timestamp''' ||
+ ||                  || post      || '''name''' || '''value'''      || 
'''timestamp''' ||
- ||                  ||           || "subject"  || "My new guitar" || 
1270073054      ||
+ ||                  ||           || "subject"  || "My new guitar"  || 
1270073054      ||
- ||                  ||           || "body"     || "a lot of text" || 
1270073054      ||
+ ||                  ||           || "body"     || "a lot of text"  || 
1270073054      ||
- ||                  ||           || "created"  || "01/01/2010"    || 
1270073054      ||
+ ||                  ||           || "created"  || "01/01/2010"     || 
1270073054      ||
- ||                  || tags      || '''name''' || '''value'''     || 
'''timestamp''' ||
+ ||                  || tags      || '''name''' || '''value'''      || 
'''timestamp''' ||
- ||                  ||           || "tag0"     || "Guitar"        || 
1270084021      ||
+ ||                  ||           || "tag0"     || "Guitar"         || 
1270084021      ||
- ||                  ||           || "tag1"     || "Instrument"    || 
1270084021      ||
+ ||                  ||           || "tag1"     || "Instrument"     || 
1270084021      ||
- || "guitar-lessons" || '''key''' ||<-3 #BBBBBB>  Columns                      
                               ||
+ || "guitar-lessons" || '''key''' ||<-3>  '''Columns'''                        
        ||
- ||                  || post      ||<#AAAAAA> name        ||<#AAAAAA> value    
        ||<#AAAAAA> timestamp  ||
+ ||                  || post      || '''name''' || '''value'''      || 
'''timestamp''' ||
- ||                  ||           ||          "subject"   ||          "Guitar 
lessons" ||          1270073054 ||
+ ||                  ||           || "subject"  || "Guitar lessons" || 
1270073054      ||
- ||                  ||           ||          "body"      ||          "a lot 
of text"  ||          1270073054 ||
+ ||                  ||           || "body"     || "a lot of text"  || 
1270073054      ||
- ||                  ||           ||          "created"   ||          
"01/03/2010"     ||          1270073054 ||
+ ||                  ||           || "created"  || "01/03/2010"     || 
1270073054      ||
- ||                  || tags      ||<#AAAAAA> name        ||<#AAAAAA> value    
        ||<#AAAAAA> timestamp  ||
+ ||                  || tags      || '''name''' || '''value'''      || 
'''timestamp''' ||
- ||                  ||           ||          "tag0"      ||          "Guitar" 
        ||          1270084021 ||
+ ||                  ||           || "tag0"     || "Guitar"         || 
1270084021      ||
- ||                  ||           ||          "tag1"      ||          
"Instrument"     ||          1270084021 ||
+ ||                  ||           || "tag1"     || "Instrument"     || 
1270084021      ||
- ||                  ||           ||          "tag2"      ||          
"Lessons"        ||          1270084021 ||
+ ||                  ||           || "tag2"     || "Lessons"        || 
1270084021      ||
- The basic concepts are:
  
+ This example show that we have two blog postings with the keys 
"my-new-guitar" and "guitar-lessons", per row we have two super columns, one 
for the post information and one for the tag information. Per super column we 
have a number of columns. For the post super column we have the subject, body 
and created columns. For the tags super column we have the tag0, tag1 and 
sometimes a tag2 column.
-     * Cluster: the machines (nodes) in a logical Cassandra instance. Clusters 
can contain multiple keyspaces.
-     * Keyspace: a namespace for ColumnFamilies, typically one per application.
-     * ColumnFamilies contain multiple columns, each of which has a name, 
value, and a timestamp, and which are referenced by row keys.
-     * SuperColumns can be thought of as columns that themselves have 
subcolumns. 
  
- We'll start from the bottom up, moving from the leaves of Cassandra's data 
structure (columns) up to the root of the tree (the cluster). 
+ == Keyspaces ==
  
+ A keyspace is the first dimension of the Cassandra hash, and is the container 
for column families. Keyspaces are of roughly the same granularity as a schema 
or database (i.e. a logical collection of tables) in the RDBMS world. They are 
the configuration and management point for column families, and is also the 
structure on which batch inserts are applied. In most cases you will have one 
keyspace for an application.
+ 
+ == Modeling your application ==
+ 
+ Unlike with relational systems, where you model entities and relationships 
and then just add indexes to support whatever queries become necessary, with 
Cassandra you need to think about what queries you want to support efficiently 
ahead of time, and model appropriately.  Since there are no 
automatically-provided indexes, you will be much closer to one !ColumnFamily 
per query than you would have been with tables:queries relationally.  Don't be 
afraid to denormalize accordingly; Cassandra is much, much faster at writes 
than relational systems.
+ 
+ Arin Sarkissian of Digg has an excellent post detailing 
[[http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model|Cassandra's 
data model]] with highly illustrative examples.
+ Ronald Mathies has some nice postings for when you want to use Cassandra with 
Java, if you have read this then you can start from part three 
[[http://www.sodeso.nl/?p=207|Installing and using Apache Cassandra With Java 
Part 3 (Data model 2)]].
+ 
+ == Furter reading ==
+ 
+ See the CassandraLimitations page for other things to keep in mind when 
designing a model.
+ See the [[http://wiki.apache.org/cassandra/API|Thrift API]] page to see what 
methods are available and how to use them.
+ See the StorageConfiguration page for more information on how to configure 
the storage-conf.xml file.
+ 
+ = Author =
+ 
+ [[http://www.sodeso.nl/?cat=10|Ronald Mathies]]
+

[Cassandra Wiki] Update of "DataModelv2" by ronaldmathi es

Reply via email to