Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The following page has been changed by JeanDanielCryans:
http://wiki.apache.org/hadoop/Hbase/DataModel

------------------------------------------------------------------------------
+ '''This page is a work in progress'''
+  
   * [#intro Introduction]
   * [#overview Overview]
+  * [#row Rows]
+  * [#columns Column Families]
+  * [#ts Timestamps]
+  * [#famatt Families Attributes]
+  * [#example Real Life Example]
+   * [#relational The Source ERD]
+   * [#hbaseschema The HBase Target Schema]
  
  [[Anchor(intro)]]
  = Introduction =
@@ -11, +20 @@

  [[Anchor(overview)]]
  = Overview =
  
- To put it simply, HBase can be reduced to a Map<byte[], Map<byte[], 
Map<byte[], Map<long, byte[]>>>>. The first Map maps row keys to their ''column 
families''. The second maps column families to their ''column keys''. The third 
one maps column keys'' to their ''timestamps''. Finally, the last one maps the 
timestamps to a single value. The keys are typically strings, the timestamp is 
a long and the value is an uninterpreted array of bytes. The 
+ To put it simply, HBase can be reduced to a Map<byte[], Map<byte[], 
Map<byte[], Map<long, byte[]>>>>. The first Map maps row keys to their ''column 
families''. The second maps column families to their ''column keys''. The third 
one maps column keys to their ''timestamps''. Finally, the last one maps the 
timestamps to a single value. The keys are typically strings, the timestamp is 
a long and the value is an uninterpreted array of bytes. The column key is 
always preceded by its family and is represented like this: ''family:key''. 
Since a family maps to another map, this means that a single column family can 
contain a theoretical infinity of column keys. So, to retrieve a single value, 
the user has to do a ''get'' using three keys:
  
  row key+column key+timestamp -> value
  
+ [[Anchor(row)]]
+ = Rows =
+ 
+ The row key is treated by HBase as an array of bytes but it must have a 
string representation. A special property of the row key Map is that it keeps 
them in a lexicographical order. For example, numbers going from 1 to 100 will 
be ordered like this:
+ 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,...,9,91,92,93,94,95,96,97,98,99
+ 
+ To keep the integers natural ordering, the row keys have to be left-padded 
with zeros. To take advantage of this, the functionalities of the row key Map 
are augmented by offering a scanner which takes a ''start row key'' (if not 
specified, the first one in the table) and an ''stop row key'' (if not 
specified, the last one in the table). For example, if the row keys are dates 
in the format YYYYMMDD, getting the month of July 2008 is a matter of opening a 
scanner from ''20080700'' to ''20080800''. It does not matter if the specified 
row keys are existing or not, the only thing to keep in mind is that the stop 
row key will not be returned which is why the first of August is given to the 
scanner. 
+ 
+ [[Anchor(columns)]]
+ = Column Families =
+ 
+ A column family regroups data of a same nature in HBase and has no constraint 
on the type. The families are part of the table schema and stay the same for 
each row; what differs from rows to rows is that the column keys can be very 
sparse. For example, row "20080702" may have in it's "info:" family the 
following column keys:
+ ||info:aaa||
+ ||info:bbb||
+ ||info:ccc||
+ While row "20080703" only has:
+ ||info:12342||
+ Developers have to be very careful when using column keys since a key with a 
length of zero is permitted which means that in the previous example data can 
be inserted in column key "info:". We strongly suggest using empty column keys 
only when no other keys will be specified. Also, since the data in a family has 
the same nature, many attributes can be specified regarding [#famatt 
performance and timestamps].
+ 
+ [[Anchor(ts)]]
+ = Timestamps =
+ 
+ The values in HBase may have multiple versions kept according to the family 
configuration. By default, HBase sets the timestamp to each new value to 
current time in milliseconds and returns the latest version when a cell is 
retrieved. The developer can also provide it's own timestamps when inserting 
data as he can specify a certain timestamp when fetching it.
+ 
+ [[Anchor(famatt)]]
+ = Family Attributes =
+ 
+ The following attributes can be specified or each families:
+ 
+ Implemented
+ 
+  * Compression
+   * Record: means that each exact values found at a 
rowkey+columnkey+timestamp will be compressed independently.
+   * Block: means that blocks in HDFS are compressed. A block may contain 
multiple records if they are shorter than one HDFS block or may only contain 
part of a record if the record is longer than a HDFS block.
+  * Timestamps
+   * Max number: the maximum number of different versions a value has.
+   * Time to live: versions older than specified time will be garbage 
collected.
+ 
+ Still not implemented
+ 
+  * In memory: all values of that family will be kept in memory.
+  * Length: values written will not be longer than the specified number of 
bytes.
+ 
+ [[Anchor(example)]]
+ = Real Life Example =
+ 
+ [[Anchor(relational)]]
+ == The Source ERD ==
+ 
+ [[Anchor(hbaseschema)]]
+ == The HBase Target Schema ==
+ 

Reply via email to