Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "FileFormatDesignDoc" page has been changed by StuHood. http://wiki.apache.org/cassandra/FileFormatDesignDoc?action=diff&rev1=8&rev2=9 -------------------------------------------------- Because we will be storing multiple columns per SSTable, our design will bear the most similarity to RCFile (rather than the column-per-file approach taken in Dremel). But because we allow for nesting via super columns (and hopefully with a more flexible representation in the future), we need to take hints from Dremel's serialization to allow for efficient storage of parent and null information. + === Vertical chunks === + - Rather than dividing the table representation into chunks using horizontal slices, we will use vertical chunks (and horizontal chunks when necessary for particularly wide rows): + Rather than slicing the span into chunks horizontally, we will use vertical chunks (and horizontal chunks only when necessary for particularly wide rows): || ''row key'' || || cheese || @@ -83, +85 @@ || 4.9 || || china || - This representation achieves the benefits for compression shown in the RCFile paper: similar values are always stored together. But we have lost some information!: Using the tables above, it is impossible to determine which fields at level "name1" are cheeses, and which are fruits. We need to store parent information, and one method comes from Dremel's clever representation for arbitrary nesting. We add a single boolean to each tuple that toggles to represent parent changes: + This representation achieves some of the benefits for compression shown in the RCFile paper: similar values are stored much closer together. But we have lost some information!: Using the tables above, it is impossible to determine which fields at level "name1" are cheeses, and which are fruits. + + === Parent representation === + + We need to store parent information, and one method comes from Dremel's clever representation for arbitrary nesting. We add a single boolean to each tuple that toggles to represent parent changes: || ''row key'' || ''parent_change'' || || cheese || 0 || @@ -116, +122 @@ The parent change flag can be represented compactly using a bitmap, and field lengths can be packed tightly into group-varint encoded arrays [3], as alluded to in the Dremel paper, and mentioned in Jeff Dean's talks. + === Metadata === + - Cassandra also needs to encode metadata about tuples and ranges of tuples, in order to represent creation and deletion timestamps: range tuples can be encoded in a similar fashion to the value tuples represented here, and the metadata itself can also be group-varint encoded. + Cassandra also needs to encode metadata about tuples and ranges of tuples, in order to represent creation and deletion timestamps: range tuples can be encoded in a similar fashion to the value tuples represented here, and the metadata timestamps can be group-varint encoded. + + === Field reordering === + + One weakness of the implementation so far is that it doesn't allow tuples to be reordered within a level. This approach performs well for wide rows with high field cardinality, since adding compression is unlikely to remove data. + + Since we have domain knowledge that a compression algorithm would not, it will often be more efficient to perform reordering by ourselves, particularly when a chunk has low cardinality: for example at the "name2" level above. By assigning the chunk an ''ordered'' type, we can store the fields in sorted order (rather than in parent-sorted order) and remove duplicates. + + || ''name2'' || + || flavor || + || origin || + + More importantly, a chunk of type ''ordered'' should influence the order of tuples in child chunks. When we encounter an ''ordered'' chunk at level "name2", we should expect its children in level "value" to be arranged as follows: + + || ''value'' || ''parent_change'' || + || 3.4 || 1 || + || 5.6 || 1 || + || 2.6 || 1 || + || 4.2 || 1 || + || 4.9 || 1 || + || || 0 || + || france || 1 || + || || 0 || + || || 0 || + || china || 1 || + + The ''parent_change'' field is now a bitmap representing nulls: it indicates that all parents have a 'flavor' tuple, but only the second and fifth parents have an 'origin' tuple. This representation is ripe for compression. == Schema ==
