Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "DataModelv2" page has been changed by ronaldmathies. http://wiki.apache.org/cassandra/DataModelv2?action=diff&rev1=17&rev2=18 -------------------------------------------------- = Introduction = - Cassandra has a data model that far different from normal relational databases, instead of having schemas, tables and column the data + Cassandra has a data model that is far different from normal relational databases, instead of having schemas, tables and column the data model consists of a structure of lists and maps. When we start to look from the highest level we have clusters, clusters are physical machines operating together and forming a logical + Cassandra instance. A cluster can contain several keyspaces. A keyspace is a group consisting of various ColumnFamilies, in general an application uses a single Keyspace. !ColumnFamilies consists of rows which in turn consists of multiple values (Columns) per row. - Cassandra instance. A cluster can contain several keyspaces. A keyspace is very similar to a relation database schema and contains - a number of !ColumnFamilies. !ColumnFamilies can be compared to a table in a relation database. And a !ColumnFamily contains Columns. - A !ColumnFamily comes in two flavors, the first one we already described, which is a !ColumnFamily which has columns. The second one + A !ColumnFamily comes in two flavors, the first one we already described, which is a !ColumnFamily which has columns. The second !ColumnFamily - is also called a !SuperColumnFamily, this one contains !SuperColumns where the !SuperColumns contain a list of Columns. If it becomes - confusing just read on, it will become clearer. + contains !SuperColumns where the !SuperColumns contain a list of Columns. If it becomes confusing just read on, it will become clearer. + + So to recap, a !ColumnFamily can contain either a list of Columns or a list of !SuperColumns. We'll start from the bottom up, moving from the leaves of Cassandra's data structure (columns) up to the root of the tree (the cluster). {{{ + -- REMARK: --- From my experience, comparing concepts to those in a relational database pretty consistently confuses people. This section intermixes the "structure of lists and maps" approach with relational db comparisons "ColumnFamilies can be compared to a table in a relational database.", which is probably worse still. - I also think it's best to avoid referring to column families as containers. + I also think it's best to avoid referring to column families as containers.'' + + -- SOLUTION: --- + Changed the above description, there is no reference to a relation database anymore }}} == Columns == - A Column is also known as a Tuple (triplet), it contains a name, value and a timestamp. + A Column consists of a name, value and a timestamp. {{{ + --- REMARK: --- This wording suggests that Tuple is a synonym for Column (which is not true). + + --- SOLUTION: --- + Removed the synonym }}} All values are supplied by the client, including the 'timestamp'. This means that clocks on the clients should be synchronized (in the Cassandra server environment is useful also), as these timestamps are used for conflict resolution. In many cases the 'timestamp' is not used in client applications, and it becomes convenient to think of a column as a name/value pair. For the remainder of this document, 'timestamps' will be elided for readability. It is also worth noting the name and value are binary values, although in many applications they are UTF8 serialized strings. @@ -69, +77 @@ </Keyspace> </Keyspaces> - In Cassandra, each column family is stored in a separate file, and the file is sorted in row (i.e. key) major order. Related columns, those that you'll access together, should be kept within the same column family. + In Cassandra, each column family is sorted in row (i.e. key) major order. Related columns, those that you'll access together, should be kept within the same column family. {{{ + --- REMARK: --- IMO, you should avoid implementation details unless they are really relevant, as it distracts, (i.e. "each column family is stored in a separate file"). + + --- SOLUTION: --- + Nice remark, this should indeed be covered in a separate topic about the storage itself. }}} The row key is what determines what machine data is stored on. A key can be used for several column families at the same time, this does however not imply that the data from these column families is related. The semantics of having data for the same key in two different column families is entirely up to the client. Also, the columns can be different between the two column families. In fact there may be a virtually unlimited set of column names defined, which leads to fairly common use of the column name as a piece of runtime populated data. This is unusual in storage systems, particularly if you're coming from the relational database world. For each key you can have data from multiple column families associated with it. However, these are logically distinct, which is why the Thrift interface is oriented around accessing one !ColumnFamily per key at a time. On the other hand, a number of methods within the Thrift interface make use of this functionality, for example the batch_insert and batch_mutate make it possible to insert or modify data in multiple !ColumnFamilies at the same time, as long as the key for the different column families are the same. @@ -122, +134 @@ || || "lastname" || "Steward" || 1270084021 || || || "birthday" || "01/01/1982" || 1270084021 || - As you can see it looks the same as a !ColumnFamily, the only difference is the usage, a !SuperColumn is used within a !SuperColumnFamily, so + As you can see it looks the same as a !ColumnFamily, the only difference is the usage, a !SuperColumn is used within a !ColumnFamily, so it adds an extra layer in your data structure, instead of having only a row which consists of a key and a list of columns we can now have a row which consists of a key and a list of super columns which by itself has keys and per key a list of columns. - == SuperColumnFamily == + == !ColumnFamily containing !SuperColumns == - The !SuperColumnFamily isn't much different from a normal !ColumnFamily except that it contains a list of super columns per row instead of + A !ColumnFamily which contains !SuperColumns isn't that much different from a !ColumnFamily containing Columns, instead of having a row consisting of Columns we have rows consisting of !SuperColumns. + - a list of columns. To following example defines a super column family in your storage-conf.xml: + The following example defines a super column family in your storage-conf.xml: {{{ + -- REMARK: --- IMO, the term "SuperColumnFamily" should die. + + -- SOLUTION: --- + And it's dead, i've removed it everywhere and rephrased the sentences to make it clear. }}} An example configuration of an Authors !ColumnFamily using the UTF-8 sorting implementation would be: @@ -143, +160 @@ </Keyspace> </Keyspaces> - The !ColumnType tells cassandra that the Posts columns family is a super column family, the !CompareSubcolumnsWith attribute defines the sorting behavior of the keys of the super columns. + The !ColumnType tells Cassandra that the Posts columns family is a !ColumnFamily containing !SuperColumns, the !CompareSubcolumnsWith attribute defines the sorting behavior of the keys of the super columns. Model representation: - ||<-2> '''!SuperColumnFamily''' || + ||<-2> '''!ColumnFamily''' || || '''key''' || '''list''' || || binary || 1 .. * !SuperColumns || Data representation: - ||<-5> '''!SuperColumnFamily''' || + ||<-5> '''!ColumnFamily''' || || '''Key''' ||<-4> '''!SuperColumns''' || || "my-new-guitar" || '''key''' ||<-3> '''Columns''' || || || post || '''name''' || '''value''' || '''timestamp''' || @@ -177, +194 @@ == Keyspaces == - A keyspace is the first dimension of the Cassandra hash, and is the container for column families. Keyspaces are of roughly the same granularity as a schema or database (i.e. a logical collection of tables) in the RDBMS world. They are the configuration and management point for column families, and is also the structure on which batch inserts are applied. In most cases you will have one keyspace for an application. + A keyspace is the first dimension of the Cassandra hash, and is the container for the !ColumnFamilies. Keyspaces are of roughly the same granularity as a schema or database (i.e. a logical collection of tables) in the RDBMS world. They are the configuration and management point for column families, and is also the structure on which batch inserts are applied. In most cases you will have one Keyspace for an application. == Modeling your application ==
