HBase Design Ideas, Part I

Michael Cafarella Sun, 14 May 2006 15:00:27 -0700

Hi everyone,

I've written up a design that I've been working on for a little bit, for
a project I'll call "HBase".  The idea is for Hadoop to implement something
similar in spirit to BigTable.  That is, a distributed data store that
places a greater emphasis on scalability than on SQL compatibility
or traditional transactional correctness.


BigTable is neither completely described anywhere, nor is it
necessarily exactly what we want.  So I'm not trying to clone BigTable,
but I am going to draw on it a lot.

My personal view is that BigTable is a great "physical layer" but not yet
a great database system.  A major thing it lacks is a good query language.
Another, freely admitted by the Google people, is any kind of inter-row
locking.  I'm not going to try to solve all these problems, but I would
like HBase to be extendible enough that it's easy to add new query
languages or primitives.

In this mail, I'll describe a system that's pretty similar to BigTable.
I'll send a second one that describes what we might want to change
or add.

Please let me know what you think!

Thanks,
--Mike

--------------------------------------------------------------------------------
I.  Table semantics

An HBase consists of one or more HTables.  An HTable is a list of rows,
sorted alphabetically by "row name".  An HTable also has a series of
"columns."  A row may or may not contain a value for a column.  The
HTable representation is sparse, so if a row does not contain a value
for a given column, there is no storage overhead.

(Thus, there's not really a "schema" to an HTable.  Every operation, even
adding a column, is considered a row-centric operation.)

The "current version" of a row is always available, timestamped with its
last modification date.  The system may also store previous versions of a row,
according to how the HTable is configured.

Updates to a single row are always atomic and can affect one or more columns.

II.  System layout

HTables are partitionable into contiguous row regions called HRegions.
All machines in a pool run an HRegionServer.  A given HRegion is served
to clients by a single HRegionServer.  A single HRegionServer may be
responsible for many HRegions.  The HRegions for a single HTable will
be scattered across arbitrary HRegionServers.

When a client wants to add/delete/update a row value, it must locate the
relevant HRegionServer.  It then contacts the HRegionServer and communicates
the updates.  There may be other steps, mainly lock-oriented ones.  But locating
the relevant HRegionServers is a bare minimum.

The HBase system can repartition an HTable at any time.  For example, many
repeated inserts at a single location may cause a single HRegion to grow
very large.  The HBase would then try to split that into multiple HRegions.
Those HRegions may be served by the same HRegionServer as the
original or may be served by a different one.

Each HRegionServer sends a regular heartbeat to an HBaseMaster machine.
If the heartbeat for an HRegionServer fails, then the HBaseMaster is responsible
for reassigning its HRegions to other available HRegionServers.

All HRegions are stored within DFS, so the HRegion is always available, even
in the face of machine failures.  The HRegionServers and DFS DataNodes run
on the same set of machines.  We would like for an HRegionServer to always
serve data stored locally, but that is not guaranteed when using DFS.  We can
encourage it by:
1) In the event of an insert-motivated HRegion move, the new HRegionServer
should always create a new DFS file for the new HRegion.  The DFS rules of
thumb will allocate the chunks locally for the HRegionServer.
2) In the even of a machine failure, we cannot do anything similar to above.
Instead, the HBaseMaster can ask DFS for hints as to where the relevant
file blocks are stored.  If possible, it will allocate the new
HRegions to servers
that physically contain the HRegion.
3) If necessary, we could add an API to DFS that demands block replication
to a given node.  I'd like to avoid this if possible.

The mapping from row to HRegion (and hence, to HRegionServer) is itself
stored in a special HTable.  The HBaseMaster is the only client allowed to
write to this HTable.  This special HTable may itself be split into several
HRegions.  However, we only allow a hard-coded number of split-levels.
The top level of this hierarchy must be easily-stored on a single machine.
That top-level table is always served by the HBaseMaster itself.

III.  Client behavior

Let's think about what happens when a client wants to add a row.
1) The client must compute what HRegion is responsible for the key
it wants to insert into the HTable.  It must navigate the row->HRegion
mapping, which is stored in an HTable.

So the client first contacts the HBaseMaster for the top-level table contents.
It then steps downward through the table set, until it finds the mapping for
the target row.

2) The client contacts the HRegionServer responsible for the target row,
and asks to insert.  If the HRegionServer is no longer responsible for the
relevant HRegion, it returns a failure message and tells the client to go
back to step 1 to find the new correct HRegionServer.

If the HRegionServer is the right place to go, it accepts the new row from
the client.  The HRegionServer guarantees that the insert is atomic; it
will not intermingle the insert with a competing insert for the same row key.
When the row is stored, the HRegionServer includes version and timestamp
information.

3) That's it!

IV The HRegionServer

Maintaining the data for a single HRegion is slightly complicated.  It's
especially weird given the write-once semantics of DFS.  There are
three important moving parts:

1) HBackedStore is a file-backed store for rows and their values.
It is never edited in place.  It has B-Tree-like lookups for finding
a row quickly.  HBackedStore is actually a series of on-disk stores,
each store being tuned for a certain object size.  Thus, all the "small"
(in bytes) values for a row live within the same file, all the medium
ones live in a separate file, etc.  There is only one HBackedStore
for any single HRegion.

2) HUpdateLog is a log of updates to the HBackedStore.  It is backed
by an on-disk file.  When making reads from the HBackedStore, it may
be necessary to consult the HUpdateLog to see if any more-recent
updates have been made.  There may be a series of HUpdateLogs
for a single HRegion.

3) HUpdateBuf is an in-memory version of HUpdateLog.  It, too, needs
to be consulted whenever performing a read.  There is only one
HUpdateBuf for a single HRegion.

Any incoming edit is first made directly to the HUpdateBuf.  Changes
made to the HUpdateBuf are volatile until flushed to an HUpdateLog.
The rate of flushes is an admin-configurable parameter.

Periodically, the HBackedStore and the series of current HUpdateLogs
are merged to form a new HBackedStore.  At that point, the old HUpdateLog
objects can be destroyed.  During this compaction process, edits are
made to the HUpdateBuf.

HBase Design Ideas, Part I

Reply via email to