[Hadoop Wiki] Trivial Update of "Hbase/NewFileFormat" by stack

Apache Wiki Fri, 03 Oct 2008 12:32:25 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The following page has been changed by stack:
http://wiki.apache.org/hadoop/Hbase/NewFileFormat

New page:
This page is for discussion related to 
[https://issues.apache.org/jira/browse/HBASE-61 HBASE-61, Create an 
HBase-specific MapFile implementation].  That issue, and its linked issues, has 
a bunch of suggestions for how we might do a better persistence.  Other related 
issues include, [https://issues.apache.org/jira/browse/HADOOP-3315 TFile], and 
[https://issues.apache.org/jira/browse/HBASE-647 HBASE-647, Remove the 
HStoreFile 'info' file (and index and bloomfilter if possible)].

== Current Implementation ==

Currently -- circa 0.19.0 -- hbase Store files are built on 
''org.apache.hadoop.io.MapFile''. MapFile is made of two 
''org.apache.hadoop.io.SequenceFile''s; a sorted data file of key/values and 
then an accompanying index file. Once written, these files do not change (both 
data and index file).  The current index is 'flat' made of keys and their 
offsets.  An index entry is made for every Nth entry of the data file where N 
is configurable with a default of 32 in hbase (its 128 for hadoop).

MapFiles can be configured to compress each key/value entry or compress based 
off a block size.  Blocks do not span key/values but break on entries.

Hbase keys are made of key/column/timestamp.  Rows and columns are effectively 
binary.  Timestamp is a long.  The sort is not a straight-forward binary sort; 
it has its idiosyncrasies embodied in the particular Comparator passed creating 
the store file: e.g. The timestamps are in reverse order because we want to 
find the newest first.

Every hbase flush creates a new MapFile in the file system and an accompanying 
SequenceFile of metadata, an 'info' file.  Metadata includes the id of the last 
edit added the MapFile and if the store file is a 'reference' file -- more on 
this later (TODO) -- it also includes info on whats referred to.

Optionally administrators can enable bloomfilters on hbase stores.  The 
bloomfilter allows a fast test of whether or not the store file contains an 
entry.  The bloomfilter is persisted into the filesystem in its own file.

Worse-case, each flush writes '''four''' files to the file system: a mapfile 
data file, the mapfile index, an accompanying 'info' file for metadata, and a 
file of the bloomfilter.

Currently, on open of a store file, the index is read into memory and then 
closed.  The data file is opened and kept open to avoid paying the 'open' cost 
on every random-access.  This latter action makes it so hbase soon trips 'too 
many open files' exception (See [http://wiki.apache.org/hadoop/Hbase/FAQ#6 FAQ 
#6]).  The info is opened, read, and then closed.  If a bloomfilter, its 
deserialized out of the bloomfilter file.

== Common Index-based Accesses ==

Lookup for a particular key, a query is first made against the MapFile index to 
find the nearest key using a binary search.  We then go to the data file and 
seek to the index offset and iterate until we find the queried key or we've 
moved past where it should have been in the file.

Another common access pattern has us asking for the row that falls closest that 
which we asked for, both closest-before and closest-after (if not an exact 
match).  To figure closest row, we go to index first and then iterate forward.

We also need to be able to figure quickly if a store file has an entry at all 
for a particular key.

== File Index ==

We need to fix the case where rows have many entries.  When scanning, we'll 
pre-populate the scanner with the  scanner start row (using the index to figure 
the offset).  On call to hasNext, we'll then iterate forward until we trip over 
the next row.  This works fine if < tens of entries per row but if millions 
hbase scanner crawls (or client lease just times out).  We at least need to be 
smarter about our use of the flat index going back to it to try figure if row 
has < tens or millions of entries per row -- or index could record every the 
start of every row offset.

Index needs to be small.  There are lots of these store files in hbase.  
Currently we open index, read into memory, then close the index file but keep 
the data file open for 'fast' random access. One improvement would be to divide 
the index into pieces -- file blocks as in TFile or as in cassandra would make 
most sense -- and optionally let go of LRU block indices when memory pressure.

If index included offset to every key, would be able to use it to figure if 
file had an entry for the queried key and every index lookup would get us exact 
offset.  But such an index would be too large to keep in memory (If values are 
small, file could have many entries.  Files are usually about 64MB but can grow 
to an upper-bound of about 1G though this is configurable and nothing to stop 
it being configured up from this).

== Other File Formats ==
Cassandra uses a Sequence File.  It adds key/values in blocks of 128 by 
default.  On the 128th entry, an index for the block keys is inlined and then a 
new block begins.  Block offsets are kept out in an index file as in MapFile.  
Bloomfilters are on by default.

== New Format ==
Have data, metadata, indices and bloomfilters, etc., all rolled up in the one 
file.

[Hadoop Wiki] Trivial Update of "Hbase/NewFileFormat" by stack

Reply via email to