[Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman

Apache Wiki Mon, 12 Feb 2007 19:45:14 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by JimKellerman:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture

------------------------------------------------------------------------------
  or '''column''' for short, the latter is called a ''map column'' or
  '''map''' for short.
  
+ There are two types of ''column'':
+  1. value is arbitrary sequence of bytes
+  1. value is an integer counter. We refer to this type of column as a 
'''counter'''
+ 
  Google makes no distinction between these two value types and groups
  them under the term ''column family''. They achieve the single valued
  column as a degenerate case of a column family. A single valued column
@@ -59, +63 @@

  application desire to implement a ''locality group'' it can do so by
  simply restricting its map column key set.
  
- We use the terms '''column''' and '''map''' throughout the rest of the 
document for consistency.
+ We use the terms '''column''', '''map''' and '''counter''' throughout the 
rest of the document for consistency.
  
  [[Anchor(conceptual)]]
  == Conceptual View ==
@@ -136, +140 @@

  lock service. The following depicts the layout of the file space:
  
  {{{
- /hbase/table-name1/columnname1
+ /hbase/instance-name/table-name1/columnname1
-                    columnname2
+                                  columnname2
-                    etc.
+                                  etc.
  
-        table-name2/columnname1
+                      table-name2/columnname1
-                    columnname2
+                                  columnname2
-                    etc.
+                                  etc.
  }}}
  
  Each column file contains configuration settings for the column:
-  * whether it is a ''column'' or a ''map''
+  * whether it is a ''column'', ''map'', or ''counter''
-   * if ''column'', whether the value is an integer counter
+   * if ''map'', map compression setting
   * tablet size
   * tablet block size
   * garbage collection policy
@@ -155, +159 @@

    * only keep data less than n days old
   * in memory setting
   * bloom filter enabled/disabled
+  * block compression
  
  Access control for each column is based on the access control setting
  for the column file.
  
  Each Hbase table is served by a number of server processes which
- comprise an Hbase '''instance'''. When a server process is started, it
+ belong to an Hbase '''instance'''. When a server process is started, it
  is told which instance to join.
  
  Since the master server and tablet servers run substantially different
@@ -176, +181 @@

  Thus the run time information looks like:
  
  {{{
- /hbase/table-name1/root-tablet
+ /hbase/instance-name/table-name1/root-tablet
-                    master.lock
+                                  master.lock
-                    servers/
+                                  servers/
-                            tablet-server-1
+                                  tablet-server-1
-                            tablet-server-2
+                                  tablet-server-2
-                            etc.
+                                  etc.
  
-        table-name2/root-tablet
+        instance-name/table-name2/root-tablet
-                    master.lock
+                                  master.lock
-                    servers/
+                                  servers/
-                            tablet-server-1
+                                  tablet-server-1
-                            tablet-server-2
+                                  tablet-server-2
-                            etc.
+                                  etc.
  }}}
  
  and the complete distributed lock server file space for a single table
@@ -222, +227 @@

  = Master Node =
  
   * Handles table creation and deletion
+   * bootstrap
+    * create instance directory in lock server file system
+    * create servers directory in lock server file system
+    * create root tablet in HDFS
+    * save location in lock server file system
+    * create first metadata tablet
+    * write metadata tablet information to root tablet
+   * column/map creation
+    * create file in lock server file system
+    * write configuration settings to file
+    * create new tablet file in HDFS
+    * write tablet information to METADATA table
+    * assign tablet to tablet server
   * Responsible for assigning tablets to tablet servers
   * Detects the addition and expiration of tablet servers
   * Balances tablet server load
@@ -331, +349 @@

   * Can be Memory-mapped
   * Can be shared by two tablets immediately after a split
   * API
-   * Lookup a given key ''(with or without timestamp?)''
+   * Lookup a given key ''(with or without timestamp)''
    * Iterate over key/value pairs
  
  As a first approximation, a Hadoop !MapFile satisfies these requirements. It 
is a persistent (lives in the Hadoop DFS), ordered (!MapFile is based on 
!SequenceFile which is strictly ordered), immutable (once written, an attempt 
to open a !MapFile for writing will overwrite the existing contents) map from 
keys to values.
@@ -360, +378 @@

    * Root tablet stores the location of all tablets in a METADATA table
    * The "root tablet" is just the first tablet in the METADATA table, it is 
never split
  
-   ''I really don't like this definition (and I know it is from the Bigtable 
paper). Aside from the fact that it is never split, the root tablet is special 
in another way: it is the metadata table for the METADATA table.''
+   ''I really don't like this definition (and I know it is from the Bigtable 
paper). Aside from the fact that it is never split, the root tablet is special 
in another way: it is the metadata table for the METADATA table.'' ''It 
'''does''' however have the same format as the rest of the METADATA table.''
  
-   '''Example:''' [[Anchor(example)]]
+  * The row key is a combination of the table name, column or map name and the 
__'''first row key'''__ of the tablet.
  
+   ''Note that this differs from Bigtable which stores the location of a 
tablet under a row key that is an encoding of the tablet's table ID and it's 
__'''end row'''__''. ''The reasons for this are:''
-   ''Suppose you had a table with one column and 4x10^9^ rows. Each row 
contains about 5KB of data resulting in a total table size of 200TB.''
- 
-   ''If each tablet holds about 100MB of data, this will require 2x10^6^ 
tablets and the same number of rows in the METADATA table.''
- 
-   ''If each metadata row is about 1KB, the METADATA table size required to 
map all the tablets is 2x10^9^ bytes. If each METADATA tablet is 100MB that 
requires 20 METADATA tablets to map the entire table, and consequently 20 root 
tablet rows to map the METADATA tablets.''
- 
-  * Stores the location of a tablet under a row key that is an encoding of the 
tablet's table ID and it's __'''end row'''__
-   * This seems counter intuitive for a couple of reasons:
-    1. Since a row key can be an arbitrary string, how do you represent the 
"maximum value" for the last tablet?
+    1. ''Since a row key can be an arbitrary string, how do you represent the 
"maximum value" for the last tablet?''
-    1. As new rows are added with row keys that are greater than the previous 
maximum value, is the metadata table only updated when the last tablet does a 
major compaction?
+    1. ''As new rows are added with row keys that are greater than the 
previous maximum value, the metadata table needs to be updated more frequently 
than if the first row key is stored''
+    1. ''If the first key in a tablet is stored instead, the minimum value 
could be represented as the empty string "" and a simple string comparison 
between a key and the first key would result in key > first key.'' If you 
represented the maximum row key as the empty string, that would require a 
special case instead of just a simple compare.''
  
-    If the first key in a tablet were stored instead, the minimum value could 
be represented as the empty string "" and a simple string comparison between a 
key and the first key would result in key > first key.
- 
-    If the keys go from 1 to n, when the tablet is split, the first tablet 
still has the row key "" as its first key and the second tablet has a first row 
key of n/2. So if a key is presented and its value is < n/2, you know that if 
it exists, it is in the first tablet and if the value >= n/2 it is in the 
second tablet.
+    1. ''If the keys go from 1 to n, when the tablet is split, the first 
tablet still has the row key "" as its first key and the second tablet has a 
first row key of n/2. So if a key is presented and its value is < n/2, you know 
that if it exists, it is in the first tablet and if the value >= n/2 it is in 
the second tablet.''
- 
-    I suppose you could represent the maximum row key as the empty string but 
that would require a special case instead of just a simple compare.
  
   * The "location" map has the ''InMemory'' tuning parameter set
   * Each row stores approximately 1KB of data in memory
   * All events pertaining to each tablet are logged here (such as when a 
tablet server starts serving a tablet)
+ 
+ === Format ===
+ 
+ The METADATA table has a single '''map''' column that stores the following 
data about the tablet it refers to:
+  * tablet file name
+  * whether the tablet contains a ''column'', ''map'' or ''counter''
+   * if ''map'', compression setting
+  * maximum desired tablet size
+  * tablet block size
+  * block compression setting
+  * garbage collection policy
+  * whether the tablet should be memory resident
+  * whether there is a bloom filter enabled for the tablet
+  * log redo point
+  * when the tablet most recently came on-line
+  * names and ordering of ''sub-tablets'' created as the result of a minor or 
merging compaction
+ 
+ [[Anchor(example)]]
+ === Example ===
+ 
+ ''Suppose you had a table with one column and 4x10^9^ rows. Each row contains 
about 5KB of data resulting in a total table size of 200TB.''
+ 
+ ''If each tablet holds about 100MB of data, this will require 2x10^6^ tablets 
and the same number of rows in the METADATA table.''
+ 
+ ''If each metadata row is about 1KB, the METADATA table size required to map 
all the tablets is 2x10^9^ bytes. If each METADATA tablet is 100MB that 
requires 20 METADATA tablets to map the entire table, and consequently 20 root 
tablet rows to map the METADATA tablets.''
  
  [[Anchor(ipc)]]
  = Inter-process Communication Messages =
@@ -405, +438 @@

  ||<:> byte ||<(> protocol version number ||
  ||<:> int ||<(> message type (op code) ||
  ||<:> int ||<(> number of parameters ||
- ||||<:> ''repeat the following for each parameter'' ||
+ ||||<:>  ''repeat the following for each parameter''  ||
  ||<:> int ||<(> parameter identifier ||
  ||<:> int ||<(> length of parameter value ||
  ||<:> byte[length] ||<(> parameter value ||
  
  [[Anchor(ipcclientmaster)]]
  == Client to Master Messages ==
+ 
+  * create table (table name)
+  * create column (type: {simple|counter|map}, max-tablet-size, 
tablet-block-size, gc-policy, compression-policy, in-memory, bloom-filter)
+  * get column metadata returns values that were specified for create column
+  * set acl (column-name, read, write, control)
+  * get acl (column-name) returns (read, write, control)
+  * get tablet server(tablet) returns server
  
  [[Anchor(ipcclienttablet)]]
  == Client to TabletServer Messages ==

[Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman

Reply via email to