[Cassandra Wiki] Update of "DataModelv2" by EricEvans

Apache Wiki Tue, 13 Apr 2010 16:59:43 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "DataModelv2" page has been changed by EricEvans.
The comment on this change is: some inline feedback.
http://wiki.apache.org/cassandra/DataModelv2?action=diff&rev1=16&rev2=17

--------------------------------------------------

  
  We'll start from the bottom up, moving from the leaves of Cassandra's data 
structure (columns) up to the root of the tree (the cluster).
  
+ {{{
+ From my experience, comparing concepts to those in a relational database 
pretty consistently confuses people. This section intermixes the "structure of 
lists and maps" approach with relational db comparisons "ColumnFamilies can be 
compared to a table in a relational database.", which is probably worse still.
+ 
+ I also think it's best to avoid referring to column families as containers.
+ }}}
+ 
  == Columns ==
  
  A Column is also known as a Tuple (triplet), it contains a name, value and a 
timestamp.
+ 
+ {{{
+ This wording suggests that Tuple is a synonym for Column (which is not true).
+ }}}
  
  All values are supplied by the client, including the 'timestamp'. This means 
that clocks on the clients should be synchronized (in the Cassandra server 
environment is useful also), as these timestamps are used for conflict 
resolution. In many cases the 'timestamp' is not used in client applications, 
and it becomes convenient to think of a column as a name/value pair. For the 
remainder of this document, 'timestamps' will be elided for readability. It is 
also worth noting the name and value are binary values, although in many 
applications they are UTF8 serialized strings.
  
@@ -60, +70 @@

  </Keyspaces>
  
  In Cassandra, each column family is stored in a separate file, and the file 
is sorted in row (i.e. key) major order. Related columns, those that you'll 
access together, should be kept within the same column family.
+ 
+ {{{
+ IMO, you should avoid implementation details unless they are really relevant, 
as it distracts, (i.e. "each column family is stored in a separate file").
+ }}}
  
  The row key is what determines what machine data is stored on. A key can be 
used for several column families at the same time, this does however not imply 
that the data from these column families is related. The semantics of having 
data for the same key in two different column families is entirely up to the 
client. Also, the columns can be different between the two column families. In 
fact there may be a virtually unlimited set of column names defined, which 
leads to fairly common use of the column name as a piece of runtime populated 
data. This is unusual in storage systems, particularly if you're coming from 
the relational database world. For each key you can have data from multiple 
column families associated with it. However, these are logically distinct, 
which is why the Thrift interface is oriented around accessing one 
!ColumnFamily per key at a time. On the other hand, a number of methods within 
the Thrift interface make use of this functionality, for example the 
batch_insert and batch_mutate make it possible to insert or modify data in 
multiple !ColumnFamilies at the same time, as long as the key for the different 
column families are the same. 
  
@@ -116, +130 @@

  
  The !SuperColumnFamily isn't much different from a normal !ColumnFamily 
except that it contains a list of super columns per row instead of
  a list of columns. To following example defines a super column family in your 
storage-conf.xml:
+ 
+ {{{
+ IMO, the term "SuperColumnFamily" should die.
+ }}}
  
  An example configuration of an Authors !ColumnFamily using the UTF-8 sorting 
implementation would be:

[Cassandra Wiki] Update of "DataModelv2" by EricEvans

Reply via email to