[Hadoop Wiki] Trivial Update of "Hbase/NewFileFormat" by JimKellerman

Apache Wiki Fri, 03 Oct 2008 16:29:02 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by JimKellerman:
http://wiki.apache.org/hadoop/Hbase/NewFileFormat

The comment on the change is:
Escape some words that look like Wiki words but are not.

------------------------------------------------------------------------------
  
  == Current Implementation ==
  
- Currently -- circa 0.19.0 -- hbase Store files are built on 
''org.apache.hadoop.io.MapFile''. MapFile is made of two 
''org.apache.hadoop.io.SequenceFile''s; a sorted data file of key/values and 
then an accompanying index file. Once written, these files do not change (both 
data and index file).  The current index is 'flat' made of keys and their 
offsets.  An index entry is made for every Nth entry of the data file where N 
is configurable with a default of 32 in hbase (its 128 for hadoop).
+ Currently -- circa 0.19.0 -- hbase Store files are built on 
''org.apache.hadoop.io.!MapFile''. !MapFile is made of two 
''org.apache.hadoop.io.SequenceFile''s; a sorted data file of key/values and 
then an accompanying index file. Once written, these files do not change (both 
data and index file).  The current index is 'flat' made of keys and their 
offsets.  An index entry is made for every Nth entry of the data file where N 
is configurable with a default of 32 in hbase (its 128 for hadoop).
  
- MapFiles can be configured to compress each key/value entry or compress based 
off a block size.  Blocks do not span key/values but break on entries.
+ !MapFiles can be configured to compress each key/value entry or compress 
based off a block size.  Blocks do not span key/values but break on entries.
  
  Hbase keys are made of key/column/timestamp.  Rows and columns are 
effectively binary.  Timestamp is a long.  The sort is not a straight-forward 
binary sort; it has its idiosyncrasies embodied in the particular Comparator 
passed creating the store file: e.g. The timestamps are in reverse order 
because we want to find the newest first.
  
- Every hbase flush creates a new MapFile in the file system and an 
accompanying SequenceFile of metadata, an 'info' file.  Metadata includes the 
id of the last edit added the MapFile and if the store file is a 'reference' 
file -- more on this later (TODO) -- it also includes info on whats referred to.
+ Every hbase flush creates a new !MapFile in the file system and an 
accompanying SequenceFile of metadata, an 'info' file.  Metadata includes the 
id of the last edit added the !MapFile and if the store file is a 'reference' 
file -- more on this later (TODO) -- it also includes info on whats referred to.
  
  Optionally administrators can enable bloomfilters on hbase stores.  The 
bloomfilter allows a fast test of whether or not the store file contains an 
entry.  The bloomfilter is persisted into the filesystem in its own file.
  
@@ -20, +20 @@

  
  == Common Index-based Accesses ==
  
- Lookup for a particular key, a query is first made against the MapFile index 
to find the nearest key using a binary search.  We then go to the data file and 
seek to the index offset and iterate until we find the queried key or we've 
moved past where it should have been in the file.
+ Lookup for a particular key, a query is first made against the !MapFile index 
to find the nearest key using a binary search.  We then go to the data file and 
seek to the index offset and iterate until we find the queried key or we've 
moved past where it should have been in the file.
  
  Another common access pattern has us asking for the row that falls closest 
that which we asked for, both closest-before and closest-after (if not an exact 
match).  To figure closest row, we go to index first and then iterate forward.
  
@@ -35, +35 @@

  If index included offset to every key, would be able to use it to figure if 
file had an entry for the queried key and every index lookup would get us exact 
offset.  But such an index would be too large to keep in memory (If values are 
small, file could have many entries.  Files are usually about 64MB but can grow 
to an upper-bound of about 1G though this is configurable and nothing to stop 
it being configured up from this).
  
  == New Format ==
-  * [https://issues.apache.org/jira/browse/HBASE-647 HBASE-647]: Have data, 
metadata, indices and bloomfilters, etc., all rolled up in the one file.  Could 
do this with [https://issues.apache.org/jira/browse/HADOOP-3315 TFile].  
SequenceFile allows addition of metadata but this facility is not exposed in 
MapFile.  Could add to MapFile but SequenceFile metadata is stored in the head 
of the SequenceFile.  Many metadata are known only after the flush: 
count-of-entries, bloomfilter, etc.
+  * [https://issues.apache.org/jira/browse/HBASE-647 HBASE-647]: Have data, 
metadata, indices and bloomfilters, etc., all rolled up in the one file.  Could 
do this with [https://issues.apache.org/jira/browse/HADOOP-3315 TFile].  
SequenceFile allows addition of metadata but this facility is not exposed in 
!MapFile.  Could add to !MapFile but SequenceFile metadata is stored in the 
head of the SequenceFile.  Many metadata are known only after the flush: 
count-of-entries, bloomfilter, etc.
   * [https://issues.apache.org/jira/browse/HBASE-519 Convert HStore to use 
only new interface methods].  If an Interface, can try different 
implementations.
   * In-memory: TFile has user supply the underlying data stream.  Could supply 
a stream hosted in  memory.
   * Always-on General bloomfilter. We know how many entries a file will have 
when we go to flush it so we can optimally size a bloomfilter.  The small 
amount of memory a bloomfilter occupies will pay for itself many-fold in the 
seeks saved trying to figure is a file contains an asked for key.
   * Optimal random-access
-  * Iterate over keys only, rather than mapfiles currenty key+values always.  
This'd be useful when trying to find closest. TFile and SequenceFile can do 
this (Its not exposed in MapFile).
+  * Iterate over keys only, rather than mapfiles currenty key+values always.  
This'd be useful when trying to find closest. TFile and SequenceFile can do 
this (Its not exposed in !MapFile).
  
  === Index ===
- TODO, but the TFile block-based rather than MapFile interval-based would seem 
better for us; indices then are of predicatable size; a seek to the index 
position will load at an amenable spot when blocks are compressed. 
+ TODO, but the TFile block-based rather than !MapFile interval-based would 
seem better for us; indices then are of predicatable size; a seek to the index 
position will load at an amenable spot when blocks are compressed. 
  
  === Nice-to-haves ===
   * Don't write out the family portion of column when writing keys.
  
  == Other File Formats ==
  
- Cassandra uses a Sequence File.  It adds key/values in blocks of 128 by 
default.  On the 128th entry, an index for the block keys is inlined and then a 
new block begins.  Block offsets are kept out in an index file as in MapFile.  
Bloomfilters are on by default.
+ Cassandra uses a Sequence File.  It adds key/values in blocks of 128 by 
default.  On the 128th entry, an index for the block keys is inlined and then a 
new block begins.  Block offsets are kept out in an index file as in !MapFile.  
Bloomfilters are on by default.
  
  From the bigtable paper, an SSTable "... contains a sequence of blocks 
(typically each block is 64KB in size, but this is configurable).  A block 
index (stored at the end of the SSTable) is used to locate blocks; the index is 
loaded into memory when the SSTable is opened.  A lookup can be performed with 
a single disk seek: we first find the appropriate block by performing a binary 
search in the in-memory index, and then reading the appropriate block from 
disk.  Optionally, an SSTable can be completely mapped into memory, which 
allows us to perform lookups and scans without touching the disk."

[Hadoop Wiki] Trivial Update of "Hbase/NewFileFormat" by JimKellerman

Reply via email to