[Cassandra Wiki] Update of "DataModel2" by JonathanEllis

Apache Wiki Mon, 31 Aug 2009 10:48:26 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The following page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/DataModel2

The comment on the change is:
clean up fundamentals

------------------------------------------------------------------------------
  Cassandra has a data model that can most easily be thought of as a four or 
five dimensional hash.
  
  The basic concepts are:
+  * Cluster: the machines (nodes) in a logical Cassandra instance.  Clusters 
can contain multiple keyspaces.
+  * Keyspace: a namespace for Column Families, typically one per application.
-  * Cluster, which can contain multiple keyspaces.
-  * Keyspace, which can contain multiple column families.
-  * Keyspaces contain multiple rows, which are referenced by keys.
-  * Column familes contain multiple columns, each of which has a value and a 
timestamp.
+  * Column Familes contain multiple columns, each of which has a name, value, 
and a timestamp, and which are referenced by row keys.
-  * Super columns can be thought of as columns that have subcolumns.
+  * Super Columns can be thought of as columns that themselves have subcolumns.
  
  We'll start from the bottom up, moving from the leaves of Cassandra's data 
structure (columns) up to the root of the tree (the cluster).
  
@@ -40, +39 @@

  
  = Column Families =
  
- A column family is a container for columns.  You define column families in 
your storage-conf.xml file, and cannot modify them (or add new column families) 
without restarting your Cassandra process.  A column family holds an ordered 
list of columns, which you can reference by the column name.
+ A column family is a container for columns, analogous to the table in a 
relational system.  You define column families in your storage-conf.xml file, 
and cannot modify them (or add new column families) without restarting your 
Cassandra process.  A column family holds an ordered list of columns, which you 
can reference by the column name.
  
+ Column families have a configurable ordering applied to the columns within 
each row, which affects the behavior of the get_slice call in the thrift API.  
Out of the box ordering implementations include ASCII, UTF-8, Long, and UUID 
(lexical or time).
- A JSON representation of a column family would be:
- {{{
- {
-   "Users": {
-     "emailAddress": {"name": "emailAddress", "value": "[email protected]"},
-     "webSite": {"name": "webSite", "value": "http://bar.com"}
-   }
- }
- }}}
- 
- In this example 'Users' is the column family and 'emailAddress' and 'webSite' 
are column names ('name' and 'value' are the actual columns).
  
  = Rows =
  
- A row-oriented database stores rows in a row-major fashion (i.e. all the 
columns in the row are kept together). A column-oriented database (such as 
Cassandra) stores data on a per-column basis. Column families allow a hybrid 
approach. They allow you to break your row (the data corresponding to a key) 
into a static number of groups a.k.a column families. In Cassandra, each column 
family is stored in a separate file, and the file is sorted in row (i.e. key) 
major order. Related columns, those that you'll access together, should ideally 
be kept within the same column family for access efficiency.
+ In Cassandra, each column family is stored in a separate file, and the file 
is sorted in row (i.e. key) major order. Related columns, those that you'll 
access together, should be kept within the same column family.
  
- Column families have a configurable ordering applied to rows, which affects 
behavior of the get_key_range call in the thrift API.  Out of the box ordering 
implementations include ASCII, UTF-8, Long, and UUID (lexical or time).
+ The row key is what determines what machine data is stored on.  Thus, for 
each key you can have data from multiple column families associated with it.  
However, these are logically distinct, which is why the Thrift interface is 
oriented around accessing one ColumnFamily per key at a time.  (TODO given 
this, is the following JSON more confusing than helpful?)
  
- A JSON representation of the row -> column family -> column structure is
+ A JSON representation of the key -> column families -> column structure is
  {{{
  {
     "mccv":{
@@ -80, +69 @@

     }
  }
  }}}
+ 
- Note that the row "mccv" identifies data in two different column families, 
"Users" and "Stats". This does not imply that data from these column families 
''must'' be related.  The semantics of having data for the same key in two 
different column families is entirely up to the application.  Also note that 
within the "Users" column family, "mccv" and "user2" have different column 
names defined.  This is perfectly valid in Cassandra.  In fact there may be a 
virtually unlimited set of column names defined, which leads to fairly common 
use of the column name as a piece of runtime populated data.  This is unusual 
in storage systems, particularly if you're coming from the RDBMS world.
+ Note that the key "mccv" identifies data in two different column families, 
"Users" and "Stats". This does not imply that data from these column families 
is related.  The semantics of having data for the same key in two different 
column families is entirely up to the application.  Also note that within the 
"Users" column family, "mccv" and "user2" have different column names defined.  
This is perfectly valid in Cassandra.  In fact there may be a virtually 
unlimited set of column names defined, which leads to fairly common use of the 
column name as a piece of runtime populated data.  This is unusual in storage 
systems, particularly if you're coming from the RDBMS world.
  
  = Keyspaces =
  
  A keyspace is the first dimension of the Cassandra hash, and is the container 
for column families. Keyspaces are of roughly the same granularity as a schema 
or database (i.e. a logical collection of tables) in the RDBMS world.  They are 
the configuration and management point for column families, and is also the 
structure on which batch inserts are applied.
  
- = Cluster =
- 
- A cluster is a collection of one or more keyspaces.  Cassandra server 
processes belong to a specific cluster.
- 
  = Super Columns =
  
- So far we've covered "normal" column families.  Cassandra also supports super 
columns and super column families.  A super column family is a column family 
whose members are super columns.  A super column is just an associative array 
of columns.  Another way to think about this... a super column is structurally 
very similar to a column family, and a super column family is a column family 
that contains column families.  
+ So far we've covered "normal" columns and rows.  Cassandra also supports 
super columns: columns whose values are super columns; that is, a super column 
is a (sorted) associative array of columns.
  
  A JSON description of this layout:
  {{{
@@ -110, +96 @@

    }
  }
  }}}
- Here my super column family is "Tags".  I have two super columns defined 
here, "cassandra" and "thrift".  Within these I have specific named bookmarks, 
each of which is a column.
+ Here my column family is "Tags".  I have two super columns defined here, 
"cassandra" and "thrift".  Within these I have specific named bookmarks, each 
of which is a column.
+ 
+ Just like normal columns, super columns are sparse: each row may contain as 
many or as few as it likes; Cassandra imposes no restrictions.
  
  == Example: SuperColumns for Search Apps ==
  
- You can think of each super column name as a term and the columns within as 
the docids with rank info and other attributes being a part of it. If you have 
keys as the userids then you can have a per-user index stored in this form. 
This is how the per user index for term search is laid out for Inbox search at 
Facebook. Furthermore since one has the option of storing data on disk sorted 
by "Time" it is very easy for the system to answer queries of the form "Give me 
the top 10 messages". For a pictorial explanation please refer to the Cassandra 
powerpoint slides presented at SIGMOD 2008.
+ You can think of each super column name as a term and the columns within as 
the docids with rank info and other attributes being a part of it. If you have 
keys as the userids then you can have a per-user index stored in this form. 
This is how the per user index for term search is laid out for Inbox search at 
Facebook. Furthermore since one has the option of storing data on disk sorted 
by "Time" it is very easy for the system to answer queries of the form "Give me 
the 10 most recent messages". For a pictorial explanation please refer to the 
Cassandra powerpoint slides presented at SIGMOD 2008.
  
  = Data Addressing =
  
@@ -149, +137 @@

  = Slice queries =
  == Slice Predicates ==
  == ColumnOrSuperColumn ==
+ 
  = Range queries =
  
- Cassandra supports pluggable partitioning schemes with a relatively small 
amount of code.  Out of the box, Cassandra provides the hash-based 
RandomPartitioner and an OrderPreservingPartitioner.  RandomPartitioner gives 
you pretty good load balancing with no further work required.  
OrderPreservingPartitioner on the other hand lets you perform range queries on 
the keys you have stored.  Systems that only support hash-based partitioning 
cannot perform range queries efficiently.
+ Cassandra supports pluggable partitioning schemes with a relatively small 
amount of code.  Out of the box, Cassandra provides the hash-based 
RandomPartitioner and an OrderPreservingPartitioner.  RandomPartitioner gives 
you pretty good load balancing with no further work required.  
OrderPreservingPartitioner on the other hand lets you perform range queries on 
the keys you have stored, but requires choosing node tokens carefully or active 
load balancing.  Systems that only support hash-based partitioning cannot 
perform range queries efficiently.
  
  = Consistency Level =

[Cassandra Wiki] Update of "DataModel2" by JonathanEllis

Reply via email to