Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The following page has been changed by MarkMcBride: http://wiki.apache.org/cassandra/DataModel2 ------------------------------------------------------------------------------ The column is the lowest/smallest increment of data. It's a tuple (triplet) that contains a name, a value and a timestamp. Here's the thrift interface definition of a Column - + {{{ struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } - + }}} And here's a column represented in JSON-ish notation: - + {{{ { // this is a column name: "emailAddress", value: "[email protected]", timestamp: 123456789 } + }}} All values are supplied by the client, including the timestamp. This means that clocks in the Cassandra environment should be synchronized, as these timestamps are used for conflict resolution. In many cases the timestamp is not used in client applications, and it becomes convenient to think of a column as a name/value pair. For the remainder of this document, timestamps will be elided for readability. It is also worth noting the name and value are binary values, although in many applications they are UTF8 serialized strings. = Column Families = A column family is a container for columns. You define columns in your storage-conf.xml file, and cannot modify them (or add new column families) without restarting your Cassandra process. A column family holds an ordered list of columns, which you can reference by the column name. A JSON representation would be - + {{{ { Users : { emailAddress : { // this is a column name: "emailAddress", @@ -41, +42 @@ value: "http://bar.com" } }} - + }}} Where "Users" is the column family, and "emailAddress" and "webSite" are columns. = Rows = @@ -49, +50 @@ A row-oriented database stores rows in a row-major fashion (i.e. all the columns in the row are kept together). A column-oriented database on the other hand stores data on a per-column basis. Column Families allow a hybrid approach. They allow you to break your row (the data corresponding to a key) into a static number of groups a.k.a Column Families. In Cassandra, each Column Family is stored in a separate file, and the file is sorted in row (i.e. key) major order. Related columns, those that you'll access together, should ideally be kept within the same column family for access efficiency. Column families have a configurable ordering applied to rows, which affects behavior of the get_key_range call in the thrift API. Out of the box ordering implementations include ASCII, UTF-8, Long, and UUID (lexical or time). A JSON representation of the row -> column family -> column structure is - + {{{ { mccv : {Users : { emailAddress : {name: "emailAddress", value: "[email protected]"} webSite : { name: "webSite", value: "http://bar.com"}} @@ -62, +63 @@ twitter : { name: "twitter", value: "user2"}} } } - + }}} Note that the row mccv identifies data in two different column families (Users and Stats). This does not imply that data from these column families *must* be related. The semantics of having data for the same key in two different column families is entirely up to the application. Also note that within the Users column family, mccv and user2 have different column names defined. This is perfectly valid in Cassandra. In fact there may be a virtually unlimited set of column names defined, which leads to fairly common use of the column name as a piece of runtime populated data. This is unusual in storage systems, particularly if you're coming from the RDBMS world. = Keyspaces = @@ -74, +75 @@ So far we've covered "normal" column families. Cassandra also supports super columns and super column families. A super column family is a column family whose members are super columns. A super column is just an associative array of columns. Another way to think about this... a super column is structurally very similar to a column family, and a super column family is a column family that contains column families. A JSON description of this layout follows - + {{{ { mccv : { Tags : { cassandra : { @@ -86, +87 @@ } } } - + }}} Here my super column family is "Tags". I have two super columns defined here, "cassandra" and "thrift". Within these I have specific named bookmarks, each of which is a column. == Example: SuperColumns for Search Apps == @@ -97, +98 @@ = Data Addressing = The Thrift API introduces the notion of column paths and column parents. These normalize to both super and normal super column families. Conceptually a column parent always refers to a set of columns. A column path always refers to a single column. Thrift definitions for these structures are - + {{{ struct ColumnParent { 3: string column_family, 4: optional binary super_column, @@ -108, +109 @@ 4: optional binary super_column, 5: optional binary column, } - + }}} Suppose we define a table called !MyTable with column families !MySuperColumnFamily (this a column family of type Super) and !MyColumnFamily (this is a simple column family). Any super column, SC in the !MySuperColumnFamily is addressed with the "!MySuperColumnFamily:SC" and any column "C" within "SC" is addressed as new ColumnPath("!MySuperColumnFamily","SC","C") @@ -128, +129 @@ = Batch Mutation = + = Attribution = + Thanks to phatduckk and asenchi for coming up with examples, text, and reviewing concepts. +
