Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "Hbase/DesignOverview" page has been changed by DougMeil.
The comment on this change is: Per stack, this page is now directing readers to 
the HBase book..
http://wiki.apache.org/hadoop/Hbase/DesignOverview?action=diff&rev1=21&rev2=22

--------------------------------------------------

- '''This page was created on 06.03.09 and now is in progress of 
construction....'''
- = Table of Contents =
  
+ The HBase design overview can now be found in the HBase book at 
[[http://hbase.apache.org/book.html#datamodel]] 
-  * [[#intro|Introduction]]
-  * [[#datamodel|Data Model]]
-   * [[#conceptual|Conceptual View]]
-   * [[#internal|Internal View]]
-  * [[#api|API]]
-  * [[#design|Architecture Design]]
-   * [[#master|HBaseMaster]]
-   * [[#hregionserv|HRegionServer]]
-   * [[#client|HBase Client]]
-  * [[#impl|Implementation]]
- <<Anchor(intro)>>
- = Introduction =
  
- This paper is HBase oriented analogue of Google 
[[http://labs.google.com/papers/bigtable.html|Bigtable paper]]. This paper will 
be self-sufficient. 
- 
- HBase is an [[http://apache.org/|Apache]] open source project whose goal is 
to provide Bigtable-like storage for the Hadoop Distributed Computing 
Environment. HBase leverages the distributed data storage provided by the 
[[http://hadoop.apache.org/core/docs/current/hdfs_design.html|Hadoop 
Distributed File System (HDFS)]] and use 
[[http://hadoop.apache.org/zookeeper/docs/current/zookeeperOver.html|ZooKeeper]]
 for coordination between HBase nodes.
- 
- Data is logically organized into tables, rows and columns. An iterator-like 
interface is available for scanning through a row range and, of course, there 
is the ability to retrieve a column value for a specific row key. Any 
particular column may have multiple versions for the same row key.
- 
- <<Anchor(datamodel)>>
- = Data Model =
- 
- Applications store data rows in labeled tables. A data row has a sortable row 
key and an arbitrary number of columns. The table is stored sparsely, so that 
rows in the same table can have widely varying numbers of columns.
- 
- HBase table is three dimensional sorted map. It maps from Cartesian product 
of row key, column key and timestamp to cell value:
- 
- '''(row:byte[] x column:byte[] x timestamp:Long) -> byte[]'''
- 
- A column name has the form ''"<family>:<label>"'' where <family> and <label> 
can be arbitrary byte arrays. A table enforces its set of <family>s (called 
''"column families"''). Adjusting the set of families is done by performing 
administrative operations on the table. However, new <label>s can be used in 
any write operation without pre-announcing it. HBase stores column families 
physically close on disk, so the items in a given column family should have 
roughly the same read/write characteristics and contain similar data.
- 
- Only a single row at a time may be locked by default. Row writes are always 
atomic, but it is also possible to lock a single row and perform both read and 
write operations on that row atomically.
- 
- An extension was added recently to allow multi-row locking, but this is not 
the default behavior and must be explicitly enabled.
- 
- More details are here [[Hbase/DataModel| The HBase Data Model]]
- 
- <<Anchor(conceptual)>>
- == Conceptual View ==
- 
- Conceptually a table may be thought of a collection of rows that are located 
by a row key (and optional timestamp) and where any column
- may not have a value for a particular row key (sparse).
- 
- <<Anchor(datamodelexample)>>
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"contents:"'' 
||||<:> '''Column''' ''"anchor:"'' ||<:> '''Column''' ''"mime:"'' ||
- ||<^|5> "com.cnn.www" ||<:> t9 || ||<)> "anchor:cnnsi.com" ||<:> "CNN" || ||
- ||<:> t8 || ||<)> "anchor:my.look.ca" ||<:> "CNN.com" || ||
- ||<:> t6 ||<:> "<html>..." || || ||<:> "text/html" ||
- ||<:> t5 ||<:> "<html>..." || || || ||
- ||<:> t3 ||<:> "<html>..." || || || ||
- 
- <<Anchor(internal)>>
- == Internal View ==
- 
- Although at a conceptual level, tables may be viewed as a sparse set of rows, 
physically they are stored on a per-column family basis. This is an important 
consideration for schema and application designers to keep in mind.
- 
- Pictorially, the table shown in the [[#datamodelexample|conceptual view]] 
above would be stored as follows:
- 
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"contents:"'' 
||
- ||<^|3> "com.cnn.www" ||<:> t6 ||<:> "<html>..." ||
- ||<:> t5 ||<:> "<html>..." ||
- ||<:> t3 ||<:> "<html>..." ||
- 
- <<BR>>
- 
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' |||| '''Column''' ''"anchor:"'' ||
- ||<^|2> "com.cnn.www" ||<:> t9 ||<)> "anchor:cnnsi.com" ||<:> "CNN" ||
- ||<:> t8 ||<)> "anchor:my.look.ca" ||<:> "CNN.com" ||
- 
- <<BR>>
- 
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"mime:"'' ||
- || "com.cnn.www" ||<:> t6 ||<:> "text/html" ||
- 
- <<BR>>
- 
- It is important to note in the diagram above that the empty cells shown in 
the conceptual view are not stored since they need not be in a column-oriented 
storage format. Thus a request for the value of the ''"contents:"'' column at 
time stamp t8 would return no value. Similarly, a request for an 
''"anchor:my.look.ca"'' value at time stamp t9 would return no value.
- 
- However, if no timestamp is supplied, the most recent value for a particular 
column would be returned and would also be the first one found since timestamps 
are stored in descending order. Thus a request for the values of all columns in 
the row "com.cnn.www" if no timestamp is specified would be: the value of 
''"contents:"'' from time stamp t6, the value of ''"anchor:cnnsi.com"'' from 
time stamp t9, the value of ''"anchor:my.look.ca"'' from time stamp t8 and the 
value of ''"mime:"'' from time stamp t6.
- 
- === Regions (Row Ranges) ===
- 
- To an application, a table appears to be a list of tuples sorted by row key 
ascending, column name ascending and timestamp descending.  Physically, tables 
are broken up into row ranges called ''regions''. Each row range contains rows 
from start-key (inclusive) to end-key (exclusive). A set of regions, sorted 
appropriately, forms an entire table. Row range identified by the table name 
and start-key.
- 
- Each column family in a region is managed by an ''Store''. Each ''Store'' may 
have one or more ''!StoreFiles'' (a Hadoop HDFS file type). !StoreFiles are 
immutable once closed. !StoreFiles are stored in the Hadoop HDFS. Other details:
-  * !StoreFiles cannot currently be mapped into memory.
-  * !StoreFiles maintain the sparse index in a separate file
-  * HBase extends !StoreFiles so that a bloom filter can be employed to 
enhance negative lookup performance. The hash function employed is one 
developed by Bob Jenkins.
- 
- <<Anchor(api)>>
- = API =
- 
- == Client API ==
- 
- See the Javadoc for 
[[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html|HTable]]
 and 
[[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HBaseAdmin.html|HBaseAdmin]]
- 
- == Scanner API ==
- 
- To obtain a scanner, a Cursor-like row 'iterator' that must be closed, 
[[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#HTable(org.apache.hadoop.hbase.HBaseConfiguration,%20java.lang.String)|instantiate
 an HTable]], and then invoke ''getScanner''.  This method returns an 
[[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Scanner.html|Scanner]]
 against which you call 
[[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Scanner.html#next()|next]]
 and ultimately 
[[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Scanner.html#close()|close]].
- 
- <<Anchor(design)>>
- = Architecture Design =
- 
- HBase has Multiple Client - Multiple Server Client-Server architecture.
- 
- There are three major components of the HBase architecture:
-  1. The HMaster (HBase master server)
-  2. The H!RegionServer (HBase region server)
-  3. The HBase client, defined by org.apache.hadoop.hbase.client.HTable
- 
- Each will be discussed in the following sections.
- 
- <<Anchor(master)>>
- == HMaster ==
- 
- There is only one HMaster for a single HBase deployment. To ensure that there 
is always one active HMaster uses 
[[http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_leaderElection|leader
 election algorithm]] and it's address is stored in !ZooKeaper.
- 
- If HMaster dies, starts new competition between not active "hot ready" 
HMasters using 
[[http://hadoop.apache.org/zookeeper/docs/current/recipes.html#sc_leaderElection|leader
 election algorithm]].
- 
- HMaster duties:
- 
-  * Cluster initialization
-  * Assigning/unassigning regions to/from H!RegionServers (unassigning is for 
load balance)
-  * Monitor the health and load of each H!RegionServer
-  * Changes to the table schema and handling table administrative functions
-  * Data localization
- 
- == Cluster initialization ==
- 
- While first start master tries to get root and root region directories from 
HDFS and after fail creates them and first meta region directory. In next start 
master will get information about cluster and it's regions.
- 
- === Assigning regions to HRegionServers ===
- 
- Each region is assigned to only one H!RegionServer or not assigned yet. The 
first region to be assigned is the ''ROOT region'' which locates all the META 
regions to be assigned. Each ''META region'' maps a number of user regions 
which comprise the multiple tables that a particular HBase instance serves. 
Once all the META regions have been assigned, the master will then assign user 
regions to the H!RegionServers, attempting to balance the number of regions 
served by each H!RegionServer.
- 
- ==== Assigned region ====
- 
- If region is assigned to some H!RegionServer it means that this region served 
by this server. While serving region H!RegionServer handles read/write requests 
to this region, caches last (not yet flushed) modifications in Memcache and 
writes all changes to WAL but persistent reliable storage of regions is 
provided by HDFS.
- 
- ==== The META Table ====
- 
- The META table stores information about every user region in HBase which 
includes a H!RegionInfo object containing information such as HRegion id, start 
and end keys, a reference to this HRegions' table descriptor, etc. and the 
address of the H!RegionServer that is currently serving the region. The META 
table can grow as the number of user regions grows.
- 
- ==== The ROOT Table ====
- 
- The ROOT table is confined to a single region and maps all the regions in the 
META table. Like the META table, it contains a H!RegionInfo object for each 
META region and the location of the H!RegionServer that is serving that META 
region.
- 
- Each row in the ROOT and META tables is approximately 1KB in size. At the 
default region size of 256MB, this means that the ROOT region can map 2.6 x 
10^5^ META regions, which in turn map a total 6.9 x 10^10^ user regions, 
meaning that approximately 1.8 x 10^19^ (2^64^) bytes of user data.
- 
- Every server (master or region) can get ''ROOT region'' location from 
!ZooKeeper. 
- 
- === Monitor the health of each HRegionServer ===
- 
- If HMaster detects a H!RegionServer is no longer reachable, it will split the 
H!RegionServer's write-ahead log so that there is now one write-ahead log for 
each region that the H!RegionServer was serving. After it has accomplished 
this, it will reassign the regions that were being served by the unreachable 
H!RegionServer.
- 
- If HMaster detects overloaded or low loaded H!RegionServer, it will unassign 
(close) some regions from most loaded H!RegionServer. Unassigned regions will 
be assigned to low loaded servers.
- 
- === Changes to the table schema and handling table administrative functions 
===
- 
- Table schema is set of tables and it's column families. HMaster can add and 
remove column families, turn on/off tables.
- 
- == Data localization ==
- 
- Clients can request location of regions to read data directly from region 
servers.
- 
- <<Anchor(hregionserv)>>
- == HRegionServer ==
- 
- H!RegionServer duties:
- 
-  * Serving HRegions assigned to H!RegionServer
-  * Handling client read and write requests
-  * Flushing cache to HDFS
-  * Keeping HLog
-  * Compactions
-  * Region Splits
- 
- === Serving HRegions assigned to HRegionServer ===
- 
- Each HRegion is served by only one H!RegionServer. When H!RegionServer starts 
serving HRegion, it reads HLog and all !StoreFiles from HDFS for this HRegion. 
While serving HRegions, H!RegionServer manage persistent storage of all changes 
to HDFS.
- 
- === Handling client read and write requests ===
- 
- Client communicates with the HMaster to get a list of HRegions and 
H!RegionServers serving them. Then client sends write/read requests directly to 
H!RegionServers.
- 
- ==== Write Requests ====
- 
- When a write request is received, it is first written to a write-ahead log 
called a ''HLog''. All write requests for every region the region server is 
serving are written to the same ''HLog''. Once the request has been written to 
the ''HLog'', the result of changes is stored in an in-memory cache called the 
''Memcache''. There is one Memcache for each Store.
- 
- ==== Read Requests ====
- 
- Reads are handled by first checking the Memcache and if the requested data is 
not found, the !StoreFiles are searched for results.
- 
- === Cache Flushes ===
- 
- When the Memcache reaches a configurable size, it is flushed to HDFS, 
creating a new !StoreFile and a marker is written to the HLog, so that when it 
is replayed, log entries before the last flush can be skipped. A flush may also 
be triggered to relieve memory pressure on the region server.
- 
- Cache flushes happen concurrently with the region server processing read and 
write requests. Just before the new !StoreFile is moved into place, reads and 
writes are suspended until the !StoreFile has been added to the list of active 
!StoreFile for the HStore.
- 
- === Keeping HLog ===
- 
- There is only one ''HLog'' per each H!RegionServer. It is write-ahead log for 
all changes in serving HRegions for this server.
- 
- There are 2 processes that restricts ''HLog'' size:
-  * Rolling process: when ''HLog'' file reaches a configurable size, ''HLog'' 
starts to write in new file and closes old one.
-  * Flushing process: when ''HLog'' reaches a configurable size, it is flushed 
to HDFS.
- 
- === Compactions ===
- 
- When the number of !StoreFiles exceeds a configurable threshold, a minor 
compaction is performed which consolidates the most recently written 
!StoreFiles. A major compaction is performed periodically which consolidates 
all the !StoreFiles into a single !StoreFile. The reason for not always 
performing a major compaction is that the oldest !StoreFile can be quite large 
and reading and merging it with the latest !StoreFiles, which are much smaller, 
can be very time consuming due to the amount of I/O involved in reading merging 
and writing the contents of the largest !StoreFile.
- 
- Compactions happen concurrently with the region server processing read and 
write requests. Just before the new !StoreFile is moved into place, reads and 
writes are suspended until the !StoreFile has been added to the list of active 
!StoreFiles for the HStore and the !StoreFiles that were merged to create the 
new !StoreFile have been removed.
- 
- === Region Splits ===
- 
- When the aggregate size of the !MapFiles for an HStore reaches a configurable 
size (currently 256MB), a region split is requested. Region splits divide the 
row range of the parent region in half and happen very quickly because the 
child regions read from the parent's !MapFile. 
- 
- The parent region is taken off-line, the region server records the new child 
regions in the META region and the master is informed that a split has taken 
place so that it can assign the children to region servers. Should the split 
message be lost, the master will discover the split has occurred since it 
periodically scans the META regions for unassigned regions.
- 
- Once the parent region is closed, read and write requests for the region are 
suspended. The client has a mechanism for detecting a region split and will 
wait and retry the request when the new children are on-line.
- 
- When a compaction is triggered in a child, the data from the parent is copied 
to the child. When both children have performed a compaction, the parent region 
is garbage collected.
- 
- <<Anchor(client)>>
- == HBase Client ==
- 
- HBase is a Heavy Client System. Each client manages its own connection to 
appropriate server.
- 
- The HBase client is responsible for finding H!RegionServers that are serving 
the particular row range of interest. On instantiation, the HBase client 
communicates with the H!BaseMaster to find the location of the ROOT region. 
This is the only communication between the client and the master.
- 
- Once the ROOT region is located, the client contacts that region server and 
scans the ROOT region to find the META region that will contain the location of 
the user region that contains the desired row range. It then contacts the 
region server that is serving that META region and scans that META region to 
determine the location of the user region.
- 
- After locating the user region, the client contacts the region server serving 
that region and issues the read or write request.
- 
- This information is cached in the client so that subsequent requests need not 
go through this process. 
- 
- Should a region be reassigned either by the master for load balancing or 
because a region server has died, the client will rescan the META table to 
determine the new location of the user region. If the META region has been 
reassigned, the client will rescan the ROOT region to determine the new 
location of the META region. If the ROOT region has been reassigned, the client 
will contact the master to determine the new ROOT region location and will 
locate the user region by repeating the original process described above.
- 
- <<Anchor(impl)>>
- = Implementation =
- Here will be details of HBase implementation. 
- 

Reply via email to