[Cassandra Wiki] Update of "DataModel" by JonathanEllis

Apache Wiki Mon, 31 Aug 2009 14:03:52 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The following page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/DataModel

The comment on the change is:
paste in DM2 content

------------------------------------------------------------------------------
  = Introduction =
  
- Basic unit of access control within Cassandra is a Column Family. A table in 
Cassandra is made up of one or many column families. A row in a table is 
uniquely identified using a unique key. The key is a string and can be of any 
size. The number of column families and the name of each column family must 
currently be fixed at the time the cluster is started. There is no limitation 
on the number of column families but it is expected that there would be 
relatively few of these. A column family can be of one of two type: Simple or 
Super. Columns within both of these are dynamically created and there is no 
limit on the number of these. Columns are constructs that are uniquely 
identified by a name, a value and a user-defined time stamp. The number of 
columns that can be contained in a column family could be very large. This can 
also vary per key. For instance key K1 could have 1024 columns/supercolumns 
while key K2 could have 64 columns/supercolumns. SuperColumns are constructs th
 at have a name and an infinite number of columns associated with them. The 
number of supercolumns associated with any column family may be very large. 
They exhibit the same characteristics as columns. The columns can be sorted by 
name or time and this can be explicitly expressed via the configuration file, 
for any given column family.
+ Cassandra has a data model that can most easily be thought of as a four or 
five dimensional hash.
  
- The main limitation on column and supercolumn size is that all data for a 
single key and column must fit (on disk) on a single machine in the cluster.  
Because keys alone are used to determine the nodes responsible for replicating 
their data, the amount of data associated with a single key has this upper 
bound.  This is an inherent limitation of the distribution model.
+ The basic concepts are:
+  * Cluster: the machines (nodes) in a logical Cassandra instance.  Clusters 
can contain multiple keyspaces.
+  * Keyspace: a namespace for !ColumnFamilies, typically one per application.
+  * !ColumnFamilies contain multiple columns, each of which has a name, value, 
and a timestamp, and which are referenced by row keys.
+  * !SuperColumns can be thought of as columns that themselves have subcolumns.
  
- Currently Cassandra also has the limitation that in the worst case, data for 
a key / ColumnFamily pair will all be deserialized into memory for a read 
request.  (But never for writes!)  This will be fixed in a future release.
+ We'll start from the bottom up, moving from the leaves of Cassandra's data 
structure (columns) up to the root of the tree (the cluster).
  
- = More Detail =
+ = Columns =
  
- A row-oriented database stores rows in a row-major fashion (i.e. all the 
columns in the row are kept together). A column-oriented database on the other 
hand stores data on a per-column basis. Column Families allow a hybrid 
approach. They allow you to break your row (the data corresponding to a key) 
into a static number of groups a.k.a Column Families. In Cassandra, each Column 
Family in a table is stored in a separate file, and the file is sorted in row 
(i.e. key) major order. Related columns, those that you'll access together, 
should ideally be kept within the same column family for access efficiency. 
Furthermore, columns in a column family can be sorted and stored on disk either 
in time sorted order or in name sorted order. SuperColumns, on the other hand, 
are always sorted by name. Columns within a super column may be sorted by time.
+ The column is the lowest/smallest increment of data. It's a tuple (triplet) 
that contains a name, a value and a timestamp.
  
- Suppose we define a table called !MyTable with column families 
!MySuperColumnFamily (this a column family of type Super) and !MyColumnFamily 
(this is simple column family). Any super column, SC in the 
!MySuperColumnFamily is addressed as "!MySuperColumnFamily:SC" and any column 
"C" within "SC" is addressed as !MySuperColumnFamily:SC:C. Any column C within 
!MySimpleColumnFamily is addressed as "!MySimpleColumnFamily:C". In short ":" 
is reserved word and should not be used as part of a Column Family name or as 
part of the name for a Super Column or Column.  (We plan to address this 
limitation for the 0.4 release.)
+ Here's the thrift interface definition of a Column
+ {{{
+ struct Column {
+   1: binary                        name,
+   2: binary                        value,
+   3: i64                           timestamp,
+ }
+ }}}
+ And here's a column represented in JSON-ish notation:
+ {{{
+ {
+   "name": "emailAddress",
+   "value": "[email protected]",
+   "timestamp": 123456789
+ }
+ }}}
+ 
+ All values are supplied by the client, including the 'timestamp'.  This means 
that clocks on the clients should be synchronized (in the Cassandra server 
environment is useful also), as these timestamps are used for conflict 
resolution.  In many cases the 'timestamp' is not used in client applications, 
and it becomes convenient to think of a column as a name/value pair. For the 
remainder of this document, 'timestamps' will be elided for readability.  It is 
also worth noting the name and value are binary values, although in many 
applications they are UTF8 serialized strings.
+ 
+ Timestamps can be anything you like, but milliseconds since 1970 is a 
convention, as returned by System.getTimeMillis() in Java. Whatever you use, it 
must be consistent across the application otherwise earlier changes may 
overwrite newer ones.
+ 
+ = Column Families =
+ 
+ A column family is a container for columns, analogous to the table in a 
relational system.  You define column families in your storage-conf.xml file, 
and cannot modify them (or add new column families) without restarting your 
Cassandra process.  A column family holds an ordered list of columns, which you 
can reference by the column name.
+ 
+ Column families have a configurable ordering applied to the columns within 
each row, which affects the behavior of the get_slice call in the thrift API.  
Out of the box ordering implementations include ASCII, UTF-8, Long, and UUID 
(lexical or time).
+ 
+ = Rows =
+ 
+ In Cassandra, each column family is stored in a separate file, and the file 
is sorted in row (i.e. key) major order. Related columns, those that you'll 
access together, should be kept within the same column family.
+ 
+ The row key is what determines what machine data is stored on.  Thus, for 
each key you can have data from multiple column families associated with it.  
However, these are logically distinct, which is why the Thrift interface is 
oriented around accessing one !ColumnFamily per key at a time.  (TODO given 
this, is the following JSON more confusing than helpful?)
+ 
+ A JSON representation of the key -> column families -> column structure is
+ {{{
+ {
+    "mccv":{
+       "Users":{
+          "emailAddress":{"name":"emailAddress", "value":"[email protected]"},
+          "webSite":{"name":"webSite", "value":"http://bar.com"}
+       },
+       "Stats":{
+          "visits":{"name":"visits", "value":"243"}
+       }
+    },
+    "user2":{
+       "Users":{
+          "emailAddress":{"name":"emailAddress", "value":"[email protected]"},
+          "twitter":{"name":"twitter", "value":"user2"}
+       }
+    }
+ }
+ }}}
+ 
+ Note that the key "mccv" identifies data in two different column families, 
"Users" and "Stats". This does not imply that data from these column families 
is related.  The semantics of having data for the same key in two different 
column families is entirely up to the application.  Also note that within the 
"Users" column family, "mccv" and "user2" have different column names defined.  
This is perfectly valid in Cassandra.  In fact there may be a virtually 
unlimited set of column names defined, which leads to fairly common use of the 
column name as a piece of runtime populated data.  This is unusual in storage 
systems, particularly if you're coming from the RDBMS world.
+ 
+ = Keyspaces =
+ 
+ A keyspace is the first dimension of the Cassandra hash, and is the container 
for column families. Keyspaces are of roughly the same granularity as a schema 
or database (i.e. a logical collection of tables) in the RDBMS world.  They are 
the configuration and management point for column families, and is also the 
structure on which batch inserts are applied.
+ 
+ = Super Columns =
+ 
+ So far we've covered "normal" columns and rows.  Cassandra also supports 
super columns: columns whose values are super columns; that is, a super column 
is a (sorted) associative array of columns.
+ 
+ A JSON description of this layout:
+ {{{
+ {
+   "mccv": {
+     "Tags": {
+       "cassandra": {
+         "incubator": {"incubator": "http://incubator.apache.org/cassandra/"},
+         "jira": {"jira": "http://issues.apache.org/jira/browse/CASSANDRA"}
+       },
+       "thrift": {
+         "jira": {"jira": "http://issues.apache.org/jira/browse/THRIFT"}
+       }
+     }  
+   }
+ }
+ }}}
+ Here my column family is "Tags".  I have two super columns defined here, 
"cassandra" and "thrift".  Within these I have specific named bookmarks, each 
of which is a column.
+ 
+ Just like normal columns, super columns are sparse: each row may contain as 
many or as few as it likes; Cassandra imposes no restrictions.
  
  = Range queries =
  
- Cassandra supports pluggable partitioning schemes with a relatively small 
amount of code.  Out of the box, Cassandra provides the hash-based 
RandomPartitioner and an OrderPreservingPartitioner.  RandomPartitioner gives 
you pretty good load balancing with no further work required.  
OrderPreservingPartitioner on the other hand lets you perform range queries on 
the keys you have stored.  Systems that only support hash-based partitioning 
cannot perform range queries efficiently.
+ Cassandra supports pluggable partitioning schemes with a relatively small 
amount of code.  Out of the box, Cassandra provides the hash-based 
RandomPartitioner and an OrderPreservingPartitioner.  RandomPartitioner gives 
you pretty good load balancing with no further work required.  
OrderPreservingPartitioner on the other hand lets you perform range queries on 
the keys you have stored, but requires choosing node tokens carefully or active 
load balancing.  Systems that only support hash-based partitioning cannot 
perform range queries efficiently.
  
- = Example: SuperColumns for Search Apps =
+ = Modeling your application =
  
- You can think of each supercolumn name as a term and the columns within as 
the docids with rank info and other attributes being a part of it. If you have 
keys as the userids then you can have a per-user index stored in this form. 
This is how the per user index for term search is laid out for Inbox search at 
Facebook. Furthermore since one has the option of storing data on disk sorted 
by "Time" it is very easy for the system to answer queries of the form "Give me 
the top 10 messages". For a pictorial explanation please refer to the Cassandra 
powerpoint slides presented at SIGMOD 2008.
+ Unlike with relational systems, where you model entities and relationships 
and then just add indexes to support whatever queries become necessary, with 
Cassandra you need to think about what queries you want to support efficiently 
ahead of time, and model appropriately.  Since there are no 
automatically-provided indexes, you will be much closer to one !ColumnFamily 
per query than you would have been with tables:queries relationally.  Don't be 
afraid to denormalize accordingly; Cassandra is much, much faster at writes 
than relational systems.
  
+ See the CassandraLimitations page for other things to keep in mind when 
designing a model.
+ 
+ == Example: SuperColumns for Search Apps ==
+ 
+ You can think of each super column name as a term and the columns within as 
the docids with rank info and other attributes being a part of it. If you have 
keys as the userids then you can have a per-user index stored in this form. 
This is how the per user index for term search is laid out for Inbox search at 
Facebook. Furthermore since one has the option of storing data on disk sorted 
by "Time" it is very easy for the system to answer queries of the form "Give me 
the 10 most recent messages". For a pictorial explanation please refer to the 
Cassandra powerpoint slides presented at SIGMOD 2008.
+ 
+ == Example: multiuser blog ==
+ 
+ TODO
+ 
+ = Thrift API =
+ 
+ Moved to ["API"].
+ 
+ = Attribution =
+ Thanks to phatduckk and asenchi for coming up with examples, text, and 
reviewing concepts.
+

[Cassandra Wiki] Update of "DataModel" by JonathanEllis

Reply via email to