[Cassandra Wiki] Trivial Update of "FileFormatDesignDoc " by StuHood

Apache Wiki Sat, 01 Jan 2011 17:37:22 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "FileFormatDesignDoc" page has been changed by StuHood.
http://wiki.apache.org/cassandra/FileFormatDesignDoc?action=diff&rev1=7&rev2=8

--------------------------------------------------

  || cheese  || gouda || flavor  || 5.6   ||
  || cheese  || gouda || origin  || france   ||
  || cheese  || swiss || flavor || 2.6 ||
+ || ''row key'' || ''name1''  || ''name2'' || ''value'' ||
  || fruit   || apple || flavor || 4.2 ||
  || fruit   || pear  || flavor || 4.9 ||
  || fruit   || pear  || origin || china ||
@@ -36, +37 @@

  ||         || gouda || flavor  || 5.6   ||
  ||         ||       || origin  || france   ||
  ||         || swiss || flavor || 2.6 ||
+ || ''row key'' || ''name1''  || ''name2'' || ''value'' ||
  || fruit   || apple || flavor || 4.2 ||
  ||         || pear  || flavor || 4.9 ||
  ||         ||       || origin || china ||
  
- The current implementation of SSTables lays data out on disk in approximately 
this way: data for rows is stored contiguously. In relation to the table 
representation above, we divide the tree into pieces using horizontal "chunks". 
One must seek to the "root" of the tree for a row in order to read the row 
index and determine which chunk the next level of the tree is stored in.
+ The current implementation of SSTables lays data out on disk in approximately 
this way: data for rows is stored contiguously. In relation to the table 
representation above, we divide the tree into pieces using horizontal "chunks". 
One must seek to the root of the tree for a row in order to read the row index 
and determine which chunk the next level of the tree is stored in.
  
- Additionally, only the first level of the tree is indexed: in order to find a 
particular column at the level labeled "name2", you would need to deserialize 
all columns at that level.
+ Additionally, only the first level of the tree is indexed: in order to find a 
particular column at the level labeled "name2", you would need to deserialize 
all columns at that level, which makes large super columns impractical.
  
  Finally, there is a second type of redundancy that the current design does 
not tackle: the column names at level "name2" are frequently repeated, but 
since rows are stored independently, we don't normalize those values. For 
narrow rows (like those shown), removing this redundancy will be our largest 
win.
  
@@ -81, +83 @@

  || 4.9 ||
  || china ||
  
- This representation achieves the benefits for compression shown in the RCFile 
paper: similar values are always stored together. But we have lost some 
information!: Using the tables above, it is impossible to determine which 
fields at level "name1" are cheeses, and which are fruits. We need to store 
parent information, and our method should come from Dremel's clever 
representation for arbitrary nesting. We add a single boolean to each tuple 
that toggles to represent parent changes:
+ This representation achieves the benefits for compression shown in the RCFile 
paper: similar values are always stored together. But we have lost some 
information!: Using the tables above, it is impossible to determine which 
fields at level "name1" are cheeses, and which are fruits. We need to store 
parent information, and one method comes from Dremel's clever representation 
for arbitrary nesting. We add a single boolean to each tuple that toggles to 
represent parent changes:
  
  || ''row key'' || ''parent_change'' ||
  || cheese  || 0 ||

[Cassandra Wiki] Trivial Update of "FileFormatDesignDoc " by StuHood

Reply via email to