[Cassandra Wiki] Update of "FileFormatDesignDoc" by Stu Hood

Apache Wiki Wed, 05 Jan 2011 01:23:58 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "FileFormatDesignDoc" page has been changed by StuHood.
The comment on this change is: Clarified the metadata discussion, added 
horizontal rules.
http://wiki.apache.org/cassandra/FileFormatDesignDoc?action=diff&rev1=15&rev2=16

--------------------------------------------------

   * Space efficient when un-compressed: remove redundancy
   * Random access to the middle of wide rows
   * Arbitrary nesting
+  * Range tombstones (for range/slice deletes)
  
  == Influences ==
  
   * Google Dremel [1] - Arbitrarily nested, field-oriented serialization
   * Hive RCFile [2] - Column-group-oriented storage
+ 
+ ----
  
  == Current Implementation ==
  
@@ -48, +51 @@

  
  Finally, there is a second type of redundancy that the current design does 
not tackle: the column names at level "name2" are frequently repeated, but 
since rows are stored independently, we don't normalize those values. For 
narrow rows (like those shown), removing this redundancy will be our largest 
win.
  
- == High level ==
+ ==== Metadata ====
+ 
+ Metadata is currently implemented such that column parents have metadata that 
covers their entire range: this means that you cannot delete arbitrary slices, 
only exact keys or names.
+ 
+ ----
+ 
+ == Proposed Implementation ==
  
  Because we will be storing multiple columns per SSTable, our design will bear 
the most similarity to RCFile (rather than the column-per-file approach taken 
in Dremel). But because we allow for nesting via super columns (and hopefully 
with a more flexible representation in the future), we need to take hints from 
Dremel's serialization to allow for efficient storage of parent and null 
information.
  
  === Vertical chunks ===
  
- Rather than slicing the span into chunks horizontally, we will use vertical 
chunks (and horizontal chunks only when necessary for particularly wide rows):
+ Rather than slicing the span into chunks horizontally, we will use vertical 
chunks (and break particularly wide rows into multiple spans):
  
  || ''row key'' ||
  || cheese  ||
@@ -122, +131 @@

  
  The parent change flag can be represented compactly using a bitmap, and field 
lengths can be packed tightly into group-varint encoded arrays [3], as alluded 
to in the Dremel paper, and mentioned in Jeff Dean's talks.
  
- === Metadata ===
- 
- Cassandra also needs to encode metadata about tuples and ranges of tuples, in 
order to represent creation and deletion timestamps: range tuples can be 
encoded in a similar fashion to the value tuples represented here, and the 
metadata timestamps can be group-varint encoded.
- 
  === Field reordering ===
  
  One weakness of the implementation so far is that it doesn't allow tuples to 
be reordered within a level. This approach performs well for wide rows with 
high field cardinality, since adding compression is unlikely to remove data.
@@ -154, +159 @@

  
  === Summary ===
  
- The final (simplified) representation of the span is:
+ A (simplified) representation of the span so far (without metadata) is:
  
  ''(parent-ordered)''
  || ''row key'' || ''parent_change'' ||
@@ -184, +189 @@

  || || 0 ||
  || china || 1 ||
  
+ == Metadata ==
+ 
+ Cassandra also needs to encode metadata about tuples and ranges of tuples in 
order to represent creation and deletion timestamps. For both value tuples and 
range tuples, a varying number (depending on value and range type) of 
timestamps will also need to be encoded.
+ 
+ === Range Metadata ===
+ 
+ Range tuples can be encoded in a very similar fashion to the value tuples 
represented above, except that they always come in pairs. It will likely make 
sense to store them in a separate blob from the value tuples, since they will 
bear very little similarity to one another (TODO: need to confirm with an 
anecdote or two).
+ 
+ || ''name1'' - ''left'' || ''name1'' - ''right''  || ''parent_change'' ||
+ || havarti || muenster || 0 ||
+ || || || 1 ||
+ 
+ This example shows a range tombstone for values at level "name1" between 
'havarti' and 'muenster': the chunk for the "name1" level stores a pair of 
range tuples for the 'cheese' parent and a nulls are stored for parents without 
any range metadata. The end result is that the span stores a tombstone from 
('cheese', 'havarti', <empty>) to ('cheese', 'muenster', null), where <empty> 
is the smallest value, and null is the largest value.
+ 
+ Note that it is not possible for ranges for a parent to overlap: in this 
case, the ranges would be resolved such that the intersection was given the 
winning timestamp, and the two remainders would use their original timestamps.
+ 
+ ==== Effect of ordering ====
+ 
+ When a chunk is marked as ''self'' ordered, range metadata should be affected 
as well: therefore, the number of ranges that need to be represented in a chunk 
should also factor into the cardinality threshold that toggles a chunk between 
''self'' and ''parent'' ordering.
+ 
+ === Value Metadata ===
+ 
+ Unlike range metadata, value (column) metadata will only be stored in the 
last chunk of a span (since the entire span is used to represent the path to 
the column, you don't really care about metadata until you've built the full 
path to the value).
+ 
+ ----
+ 
  == Schema ==
  
  To iterate quickly, the initial versions of the file format will be 
implemented using Avro. The current Avro schema:

[Cassandra Wiki] Update of "FileFormatDesignDoc" by Stu Hood

Reply via email to