[Cassandra Wiki] Update of "DataModel2" by JonathanEllis

Apache Wiki Mon, 31 Aug 2009 11:26:46 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The following page has been changed by JonathanEllis:
http://wiki.apache.org/cassandra/DataModel2

The comment on the change is:
thrift stuff goes on API page

------------------------------------------------------------------------------
  
  The basic concepts are:
   * Cluster: the machines (nodes) in a logical Cassandra instance.  Clusters 
can contain multiple keyspaces.
-  * Keyspace: a namespace for Column Families, typically one per application.
+  * Keyspace: a namespace for !ColumnFamilies, typically one per application.
-  * Column Familes contain multiple columns, each of which has a name, value, 
and a timestamp, and which are referenced by row keys.
+  * !ColumnFamilies contain multiple columns, each of which has a name, value, 
and a timestamp, and which are referenced by row keys.
-  * Super Columns can be thought of as columns that themselves have subcolumns.
+  * !SuperColumns can be thought of as columns that themselves have subcolumns.
  
  We'll start from the bottom up, moving from the leaves of Cassandra's data 
structure (columns) up to the root of the tree (the cluster).
  
@@ -47, +47 @@

  
  In Cassandra, each column family is stored in a separate file, and the file 
is sorted in row (i.e. key) major order. Related columns, those that you'll 
access together, should be kept within the same column family.
  
- The row key is what determines what machine data is stored on.  Thus, for 
each key you can have data from multiple column families associated with it.  
However, these are logically distinct, which is why the Thrift interface is 
oriented around accessing one ColumnFamily per key at a time.  (TODO given 
this, is the following JSON more confusing than helpful?)
+ The row key is what determines what machine data is stored on.  Thus, for 
each key you can have data from multiple column families associated with it.  
However, these are logically distinct, which is why the Thrift interface is 
oriented around accessing one !ColumnFamily per key at a time.  (TODO given 
this, is the following JSON more confusing than helpful?)
  
  A JSON representation of the key -> column families -> column structure is
  {{{
@@ -100, +100 @@

  
  Just like normal columns, super columns are sparse: each row may contain as 
many or as few as it likes; Cassandra imposes no restrictions.
  
+ = Range queries =
+ 
+ Cassandra supports pluggable partitioning schemes with a relatively small 
amount of code.  Out of the box, Cassandra provides the hash-based 
RandomPartitioner and an OrderPreservingPartitioner.  RandomPartitioner gives 
you pretty good load balancing with no further work required.  
OrderPreservingPartitioner on the other hand lets you perform range queries on 
the keys you have stored, but requires choosing node tokens carefully or active 
load balancing.  Systems that only support hash-based partitioning cannot 
perform range queries efficiently.
+ 
+ = Modeling your application =
+ 
+ Unlike with relational systems, where you model entities and relationships 
and then just add indexes to support whatever queries become necessary, with 
Cassandra you need to think about what queries you want to support efficiently 
ahead of time, and model appropriately.  Since there are no 
automatically-provided indexes, you will be much closer to one !ColumnFamily 
per query than you would have been with tables:queries relationally.  Don't be 
afraid to denormalize accordingly; Cassandra is much, much faster at writes 
than relational systems.
+ 
  == Example: SuperColumns for Search Apps ==
  
  You can think of each super column name as a term and the columns within as 
the docids with rank info and other attributes being a part of it. If you have 
keys as the userids then you can have a per-user index stored in this form. 
This is how the per user index for term search is laid out for Inbox search at 
Facebook. Furthermore since one has the option of storing data on disk sorted 
by "Time" it is very easy for the system to answer queries of the form "Give me 
the 10 most recent messages". For a pictorial explanation please refer to the 
Cassandra powerpoint slides presented at SIGMOD 2008.
  
- = Data Addressing =
+ == Example: multiuser blog ==
  
- The Thrift API introduces the notion of column paths and column parents.  
These normalize to both super and normal column families.  Conceptually a 
column parent always refers to a set of columns.  A column path always refers 
to a single column.
+ TODO
  
+ = Thrift API =
- The thrift definitions for these structures are:
- {{{
- struct ColumnParent {
-     3: string          column_family,
-     4: optional binary super_column,
- }
  
+ Moved to ["API"].
- struct ColumnPath {
-     3: string          column_family,
-     4: optional binary super_column,
-     5: optional binary column,
- }
- }}}
- 
- {{{#!wiki comments
- Edited up to here on 08/24. Will work on the rest soon. -asenchi
- }}}
- 
- Suppose we define a table called !MyTable with column families 
!MySuperColumnFamily (this a column family of type Super) and !MyColumnFamily 
(this is a simple column family). Any super column, SC in the 
!MySuperColumnFamily is addressed with the  "!MySuperColumnFamily:SC" and any 
column "C" within "SC" is addressed as 
- 
- new ColumnPath("!MySuperColumnFamily","SC","C")
- 
- Any column C within !MySimpleColumnFamily is addressed as 
- 
- new ColumnPath("!MySimpleColumnFamily",null,"C")
- 
- = Slice queries =
- == Slice Predicates ==
- == ColumnOrSuperColumn ==
- 
- = Range queries =
- 
- Cassandra supports pluggable partitioning schemes with a relatively small 
amount of code.  Out of the box, Cassandra provides the hash-based 
RandomPartitioner and an OrderPreservingPartitioner.  RandomPartitioner gives 
you pretty good load balancing with no further work required.  
OrderPreservingPartitioner on the other hand lets you perform range queries on 
the keys you have stored, but requires choosing node tokens carefully or active 
load balancing.  Systems that only support hash-based partitioning cannot 
perform range queries efficiently.
- 
- = Consistency Level =
- 
- = Batch Mutation =
  
  = Attribution =
  Thanks to phatduckk and asenchi for coming up with examples, text, and 
reviewing concepts.

[Cassandra Wiki] Update of "DataModel2" by JonathanEllis

Reply via email to