[Cassandra Wiki] Update of "DataModel2" by CurtMicol

Apache Wiki Mon, 24 Aug 2009 12:07:43 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The following page has been changed by CurtMicol:
http://wiki.apache.org/cassandra/DataModel2

The comment on the change is:
Cleaning up the JSON structures and some wording.

------------------------------------------------------------------------------
  = Introduction =
  
- Cassandra has a data model that can most easily be thought of as a four or 
five dimensional hash.  The basic concepts are a cluster, which can contain 
multiple keyspaces.  Each keyspace can contain multiple column families.  
Keyspaces contain multiple rows, which are referenced by keys.  These rows 
contain multiple columns, each of which has a value and a timestamp.  Super 
columns can be thought of as columns that have subcolumns. We'll start from the 
bottom up, moving from the leaves of Cassandra's data structure (columns) up to 
the root of the tree (the cluster).
+ Cassandra has a data model that can most easily be thought of as a four or 
five dimensional hash.
+ 
+ The basic concepts are:
+ * Cluster, which can contain multiple keyspaces.
+ * Keyspace, which can contain multiple column families.
+ * Keyspaces contain multiple rows, which are referenced by keys.
+ * Rows contain multiple columns, each of which has a value and a timestamp.
+ * Super columns can be thought of as columns that have subcolumns.
+ 
+ We'll start from the bottom up, moving from the leaves of Cassandra's data 
structure (columns) up to the root of the tree (the cluster).
  
  = Columns =
  
@@ -11, +20 @@

  Here's the thrift interface definition of a Column
  {{{
  struct Column {
-    1: binary                        name,
+   1: binary                        name,
-    2: binary                        value,
+   2: binary                        value,
-    3: i64                           timestamp,
+   3: i64                           timestamp,
  }
  }}}
  And here's a column represented in JSON-ish notation:
  {{{
- {  // this is a column
+ {
-     name: "emailAddress",
+   "name": "emailAddress",
-     value: "[email protected]",
+   "value": "[email protected]",
-     timestamp: 123456789
+   "timestamp": 123456789
  }
  }}}
  
+ All values are supplied by the client, including the 'timestamp'.  This means 
that clocks on the clients should be synchronized (in the Cassandra server 
environment is useful also), as these timestamps are used for conflict 
resolution.  In many cases the 'timestamp' is not used in client applications, 
and it becomes convenient to think of a column as a name/value pair. For the 
remainder of this document, 'timestamps' will be elided for readability.  It is 
also worth noting the name and value are binary values, although in many 
applications they are UTF8 serialized strings.
- 
- All values are supplied by the client, including the timestamp.  This means 
that clocks in the Cassandra environment should be synchronized, as these 
timestamps are used for conflict resolution.  In many cases the timestamp is 
not used in client applications, and it becomes convenient to think of a column 
as a name/value pair. For the remainder of this document, timestamps will be 
elided for readability.  It is also worth noting the name and value are binary 
values, although in many applications they are UTF8 serialized strings.
  
  Timestamps can be anything you like, but milliseconds since 1970 is a 
convention, as returned by System.getTimeMillis() in Java. Whatever you use, it 
must be consistent across the application otherwise earlier changes may 
overwrite newer ones.
  
  = Column Families =
+ 
- A column family is a container for columns.  You define column families in 
your storage-conf.xml file, and cannot modify them (or add new column families) 
without restarting your Cassandra process.  A column family holds an ordered 
list of columns, which you can reference by the column name.  A JSON 
representation would be
+ A column family is a container for columns.  You define column families in 
your storage-conf.xml file, and cannot modify them (or add new column families) 
without restarting your Cassandra process.  A column family holds an ordered 
list of columns, which you can reference by the column name.
+ 
+ A JSON representation of a column family would be:
  {{{
+ {
- { Users : {
+   "Users": {
+     "emailAddress": {"name": "emailAddress", "value": "[email protected]"},
+     "webSite": {"name": "webSite", "value": "http://bar.com"}
-   emailAddress : {  // this is a column
-     name: "emailAddress",
-     value: "[email protected]"
    }
+ }
+ }}}
  
+ In this example 'Users' is the column family and 'emailAddress' and 'webSite' 
are column names ('name' and 'value' are the actual columns).
-   webSite : {  // this is a column
-     name: "webSite",
-     value: "http://bar.com";
-   }
- }}
- }}}
- Where "Users" is the column family, and "emailAddress" and "webSite" are 
columns.
  
  = Rows =
  
- A row-oriented database stores rows in a row-major fashion (i.e. all the 
columns in the row are kept together). A column-oriented database on the other 
hand stores data on a per-column basis. Column Families allow a hybrid 
approach. They allow you to break your row (the data corresponding to a key) 
into a static number of groups a.k.a Column Families. In Cassandra, each Column 
Family is stored in a separate file, and the file is sorted in row (i.e. key) 
major order. Related columns, those that you'll access together, should ideally 
be kept within the same column family for access efficiency. Column families 
have a configurable ordering applied to rows, which affects behavior of the 
get_key_range call in the thrift API.  Out of the box ordering implementations 
include ASCII, UTF-8, Long, and UUID (lexical or time).
+ A row-oriented database stores rows in a row-major fashion (i.e. all the 
columns in the row are kept together). A column-oriented database (such as 
Cassandra) stores data on a per-column basis. Column families allow a hybrid 
approach. They allow you to break your row (the data corresponding to a key) 
into a static number of groups a.k.a column families. In Cassandra, each column 
family is stored in a separate file, and the file is sorted in row (i.e. key) 
major order. Related columns, those that you'll access together, should ideally 
be kept within the same column family for access efficiency.
+ 
+ Column families have a configurable ordering applied to rows, which affects 
behavior of the get_key_range call in the thrift API.  Out of the box ordering 
implementations include ASCII, UTF-8, Long, and UUID (lexical or time).
  
  A JSON representation of the row -> column family -> column structure is
  {{{
- { mccv : {Users : {
+ {
+    "mccv":{
+       "Users":{
-       emailAddress : {name: "emailAddress", value: "[email protected]"}
+          "emailAddress":{"name":"emailAddress", "value":"[email protected]"},
-       webSite : {  name: "webSite", value: "http://bar.com"}}
+          "webSite":{"name":"webSite", "value":"http://bar.com"}
+       },
-     Stats : {
+       "Stats":{
-       visits : {name: "visits", value: "243"}
+          "visits":{"name":"visits", "value":"243"}
+       }
+    },
+    "user2":{
+       "Users":{
+          "emailAddress":{"name":"emailAddress", "value":"[email protected]"},
+          "twitter":{"name":"twitter", "value":"user2"}
+       }
-     }
+    }
-   }
-   user2 : {Users : {
-     emailAddress : {name: "emailAddress", value: "[email protected]"}
-     twitter : {  name: "twitter", value: "user2"}}
-   }
  }
  }}}
- Note that the row mccv identifies data in two different column families 
(Users and Stats). This does not imply that data from these column families 
*must* be related.  The semantics of having data for the same key in two 
different column families is entirely up to the application.  Also note that 
within the Users column family, mccv and user2 have different column names 
defined.  This is perfectly valid in Cassandra.  In fact there may be a 
virtually unlimited set of column names defined, which leads to fairly common 
use of the column name as a piece of runtime populated data.  This is unusual 
in storage systems, particularly if you're coming from the RDBMS world.
+ Note that the row "mccv" identifies data in two different column families, 
"Users" and "Stats". This does not imply that data from these column families 
''must'' be related.  The semantics of having data for the same key in two 
different column families is entirely up to the application.  Also note that 
within the "Users" column family, "mccv" and "user2" have different column 
names defined.  This is perfectly valid in Cassandra.  In fact there may be a 
virtually unlimited set of column names defined, which leads to fairly common 
use of the column name as a piece of runtime populated data.  This is unusual 
in storage systems, particularly if you're coming from the RDBMS world.
  
  = Keyspaces =
  
@@ -80, +94 @@

  
  So far we've covered "normal" column families.  Cassandra also supports super 
columns and super column families.  A super column family is a column family 
whose members are super columns.  A super column is just an associative array 
of columns.  Another way to think about this... a super column is structurally 
very similar to a column family, and a super column family is a column family 
that contains column families.  
  
- A JSON description of this layout follows
+ A JSON description of this layout:
  {{{
+ {
- { mccv : {
+   "mccv": {
-     Tags : {
+     "Tags": {
-         cassandra : {
+       "cassandra": {
-             incubator : { incubator : 
"http://incubator.apache.org/cassandra/"},
+         "incubator": {"incubator": "http://incubator.apache.org/cassandra/"},
-             jira : { jira : "http://issues.apache.org/jira/browse/CASSANDRA"}
+         "jira": {"jira": "http://issues.apache.org/jira/browse/CASSANDRA"}
-         },
+       },
-         thrift : {
+       "thrift": {
-             jira : { jira : "http://issues.apache.org/jira/browse/THRIFT"}
+         "jira": {"jira": "http://issues.apache.org/jira/browse/THRIFT"}
-         }
+       }
      }  
+   }
  }
  }}}
  Here my super column family is "Tags".  I have two super columns defined 
here, "cassandra" and "thrift".  Within these I have specific named bookmarks, 
each of which is a column.
  
  == Example: SuperColumns for Search Apps ==
  
- You can think of each supercolumn name as a term and the columns within as 
the docids with rank info and other attributes being a part of it. If you have 
keys as the userids then you can have a per-user index stored in this form. 
This is how the per user index for term search is laid out for Inbox search at 
Facebook. Furthermore since one has the option of storing data on disk sorted 
by "Time" it is very easy for the system to answer queries of the form "Give me 
the top 10 messages". For a pictorial explanation please refer to the Cassandra 
powerpoint slides presented at SIGMOD 2008.
+ You can think of each super column name as a term and the columns within as 
the docids with rank info and other attributes being a part of it. If you have 
keys as the userids then you can have a per-user index stored in this form. 
This is how the per user index for term search is laid out for Inbox search at 
Facebook. Furthermore since one has the option of storing data on disk sorted 
by "Time" it is very easy for the system to answer queries of the form "Give me 
the top 10 messages". For a pictorial explanation please refer to the Cassandra 
powerpoint slides presented at SIGMOD 2008.
- 
  
  = Data Addressing =
  
- The Thrift API introduces the notion of column paths and column parents.  
These normalize to both super and normal super column families.  Conceptually a 
column parent always refers to a set of columns.  A column path always refers 
to a single column.  Thrift definitions for these structures are
+ The Thrift API introduces the notion of column paths and column parents.  
These normalize to both super and normal column families.  Conceptually a 
column parent always refers to a set of columns.  A column path always refers 
to a single column.
+ 
+ The thrift definitions for these structures are:
  {{{
  struct ColumnParent {
      3: string          column_family,
@@ -116, +133 @@

      5: optional binary column,
  }
  }}}
+ 
+ /* Edited up to here on 08/24. Will work on the rest soon. -asenchi */
+ 
  Suppose we define a table called !MyTable with column families 
!MySuperColumnFamily (this a column family of type Super) and !MyColumnFamily 
(this is a simple column family). Any super column, SC in the 
!MySuperColumnFamily is addressed with the  "!MySuperColumnFamily:SC" and any 
column "C" within "SC" is addressed as 
  
  new ColumnPath("!MySuperColumnFamily","SC","C")

[Cassandra Wiki] Update of "DataModel2" by CurtMicol

Reply via email to