[Hadoop Wiki] Trivial Update of "Hbase/NewFileFormat" by stack

Apache Wiki Fri, 03 Oct 2008 12:50:07 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by stack:
http://wiki.apache.org/hadoop/Hbase/NewFileFormat

------------------------------------------------------------------------------
- This page is for discussion related to 
[https://issues.apache.org/jira/browse/HBASE-61 HBASE-61, Create an 
HBase-specific MapFile implementation].  That issue, and its linked issues, has 
a bunch of suggestions for how we might do a better persistence.  Other related 
issues include, [https://issues.apache.org/jira/browse/HADOOP-3315 TFile], and 
[https://issues.apache.org/jira/browse/HBASE-647 HBASE-647, Remove the 
HStoreFile 'info' file (and index and bloomfilter if possible)].
+ This page is for discussion related to 
[https://issues.apache.org/jira/browse/HBASE-61 HBASE-61, Create an 
HBase-specific MapFile implementation].  That issue, and its linked issues, has 
a bunch of suggestions for how we might do a better persistence.  Most have 
been replicated in the ''New Format'' section below.  Other related issues 
include, [https://issues.apache.org/jira/browse/HADOOP-3315 TFile], and 
[https://issues.apache.org/jira/browse/HBASE-647 HBASE-647, Remove the 
HStoreFile 'info' file (and index and bloomfilter if possible)].
  
  == Current Implementation ==
  
@@ -35, +35 @@

  If index included offset to every key, would be able to use it to figure if 
file had an entry for the queried key and every index lookup would get us exact 
offset.  But such an index would be too large to keep in memory (If values are 
small, file could have many entries.  Files are usually about 64MB but can grow 
to an upper-bound of about 1G though this is configurable and nothing to stop 
it being configured up from this).
  
  == New Format ==
-  * [https://issues.apache.org/jira/browse/HBASE-647 HBASE-647]: Have data, 
metadata, indices and bloomfilters, etc., all rolled up in the one file.  Could 
do this with [https://issues.apache.org/jira/browse/HADOOP-3315 TFile].
+  * [https://issues.apache.org/jira/browse/HBASE-647 HBASE-647]: Have data, 
metadata, indices and bloomfilters, etc., all rolled up in the one file.  Could 
do this with [https://issues.apache.org/jira/browse/HADOOP-3315 TFile].  
SequenceFile allows addition of metadata but this facility is not exposed in 
MapFile.  Could add to MapFile but SequenceFile metadata is stored in the head 
of the SequenceFile.  Many metadata are known only after the flush: 
count-of-entries, bloomfilter, etc.
+  * [https://issues.apache.org/jira/browse/HBASE-519 Convert HStore to use 
only new interface methods].  If an Interface, can try different 
implementations.
+  * In-memory: TFile has user supply the underlying data stream.  Could supply 
a stream hosted in  memory.
+  * Always-on General bloomfilter. We know how many entries a file will have 
when we go to flush it so we can optimally size a bloomfilter.  The small 
amount of memory a bloomfilter occupies will pay for itself many-fold in the 
seeks saved trying to figure is a file contains an asked for key.
+  * Optimal random-access
+  * Iterate over keys only, rather than mapfiles currenty key+values always.  
This'd be useful when trying to find closest. TFile and SequenceFile can do 
this (Its not exposed in MapFile).
+  
+ 
+ === Nice-to-haves ===
+  * Don't write out the family portion of column when writing keys.
  
  == Other File Formats ==
  Cassandra uses a Sequence File.  It adds key/values in blocks of 128 by 
default.  On the 128th entry, an index for the block keys is inlined and then a 
new block begins.  Block offsets are kept out in an index file as in MapFile.  
Bloomfilters are on by default.

[Hadoop Wiki] Trivial Update of "Hbase/NewFileFormat" by stack

Reply via email to