[Cassandra Wiki] Update of "DataModel" by StuHood

Apache Wiki Tue, 09 Jun 2009 09:44:46 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The following page has been changed by StuHood:
http://wiki.apache.org/cassandra/DataModel

------------------------------------------------------------------------------
  = Introduction =
  
- Basic unit of access control within Cassandra is a Column Family. A table in 
Cassandra is made up of one or many column families. A row in a table is 
uniquely identified using a unique key. The is key is a string and can be of 
any size. The number of column families and the name of each column family must 
currently be fixed at the time the cluster is started. There is no limitation 
on the number of column families but it is expected that there would be 
relatively few of these. A column family can be of one of two type: Simple or 
Super. Columns within both of these are dynamically created and there is no 
limit on the number of these. Columns are constructs that are uniquely 
identified by a name, a value and a user-defined time stamp. The number of 
columns that can be contained in a column family could be very large. This can 
also vary per key. For instance key K1 could have 1024 columns/supercolumns 
while key K2 could have 64 columns/supercolumns. Supercolumns are constructs
  that have a name and an infinite number of columns associated with them. The 
number of supercolumns associated with any column family may be very large. 
They exhibit the same characteristics as columns. The columns can be sorted by 
name or time and this can be explicitly expressed via the configuration file, 
for any given column family.
+ Basic unit of access control within Cassandra is a Column Family. A table in 
Cassandra is made up of one or many column families. A row in a table is 
uniquely identified using a unique key. The key is a string and can be of any 
size. The number of column families and the name of each column family must 
currently be fixed at the time the cluster is started. There is no limitation 
on the number of column families but it is expected that there would be 
relatively few of these. A column family can be of one of two type: Simple or 
Super. Columns within both of these are dynamically created and there is no 
limit on the number of these. Columns are constructs that are uniquely 
identified by a name, a value and a user-defined time stamp. The number of 
columns that can be contained in a column family could be very large. This can 
also vary per key. For instance key K1 could have 1024 columns/supercolumns 
while key K2 could have 64 columns/supercolumns. SuperColumns are constructs th
 at have a name and an infinite number of columns associated with them. The 
number of supercolumns associated with any column family may be very large. 
They exhibit the same characteristics as columns. The columns can be sorted by 
name or time and this can be explicitly expressed via the configuration file, 
for any given column family.
  
- The main limitation on column and supercolumn size is that all data for a 
single key must fit (on disk) on a single machine in the cluster.  Because keys 
alone are used to determine the nodes responsible for replicating their data, 
the amount of data associated with a single key has this upper bound.  This is 
an inherent limitation of the distribution model.
+ The main limitation on column and supercolumn size is that all data for a 
single key and column must fit (on disk) on a single machine in the cluster.  
Because keys alone are used to determine the nodes responsible for replicating 
their data, the amount of data associated with a single key has this upper 
bound.  This is an inherent limitation of the distribution model.
  
  Currently Cassandra also has the limitation that in the worst case, data for 
a key / ColumnFamily pair will all be deserialized into memory for a read 
request.  (But never for writes!)  This will be fixed in a future release.
  
  = More Detail =
  
- A row-oriented database stores rows in a row major fashion (i.e. all the 
columns in the row are kept together). A column-oriented database on the other 
hand stores data on a per-column basis. Column Families allow a hybrid 
approach. It allows you to break your row (the data corresponding to a key) 
into a static number of groups a.k.a Column Families. In Cassandra, the data in 
a table is stored in a separate file on a per-Column Family basis. And within 
each column family, the data is stored in row (i.e. key) major order. Related 
columns, those that you'll access together, should ideally be kept within the 
same column family for access efficiency. Furthermore columns in a column 
family can be sorted and stored on disk either in time sorted order or in name 
sorted order. However, individual SuperColumns are always sorted by name.  
Columns within a super column may be sorted by time. Suppose we define a table 
called !MyTable with column families !MySuperColumnFamily (this a co
 lumn family of type Super) and !MyColumnFamily (this is simple column family). 
Any super column, SC in the !MySuperColumnFamily is addressed as 
"!MySuperColumnFamily:SC" and any column "C" within "SC" is addressed as 
!MySuperColumnFamily:SC:C. Any column C within !MySimpleColumnFamily is 
addressed as "!MySimpleColumnFamily:C". In short ":" is reserved word and 
should not be used as part of a Column Family name or as part of the name for a 
Super Column or Column.  (We plan to address this limitation for the 0.4 
release.)
+ A row-oriented database stores rows in a row-major fashion (i.e. all the 
columns in the row are kept together). A column-oriented database on the other 
hand stores data on a per-column basis. Column Families allow a hybrid 
approach. They allow you to break your row (the data corresponding to a key) 
into a static number of groups a.k.a Column Families. In Cassandra, each Column 
Family in a table is stored in a separate file, and the file is sorted in row 
(i.e. key) major order. Related columns, those that you'll access together, 
should ideally be kept within the same column family for access efficiency. 
Furthermore, columns in a column family can be sorted and stored on disk either 
in time sorted order or in name sorted order. SuperColumns, on the other hand, 
are always sorted by name. Columns within a super column may be sorted by time.
+ 
+ Suppose we define a table called !MyTable with column families 
!MySuperColumnFamily (this a column family of type Super) and !MyColumnFamily 
(this is simple column family). Any super column, SC in the 
!MySuperColumnFamily is addressed as "!MySuperColumnFamily:SC" and any column 
"C" within "SC" is addressed as !MySuperColumnFamily:SC:C. Any column C within 
!MySimpleColumnFamily is addressed as "!MySimpleColumnFamily:C". In short ":" 
is reserved word and should not be used as part of a Column Family name or as 
part of the name for a Super Column or Column.  (We plan to address this 
limitation for the 0.4 release.)
  
  = Range queries =

[Cassandra Wiki] Update of "DataModel" by StuHood

Reply via email to