Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by JeanDanielCryans: http://wiki.apache.org/hadoop/Hbase/DataModel ------------------------------------------------------------------------------ + '''This page is a work in progress''' + * [#intro Introduction] * [#overview Overview] + * [#row Rows] + * [#columns Column Families] + * [#ts Timestamps] + * [#famatt Families Attributes] + * [#example Real Life Example] + * [#relational The Source ERD] + * [#hbaseschema The HBase Target Schema] [[Anchor(intro)]] = Introduction = @@ -11, +20 @@ [[Anchor(overview)]] = Overview = - To put it simply, HBase can be reduced to a Map<byte[], Map<byte[], Map<byte[], Map<long, byte[]>>>>. The first Map maps row keys to their ''column families''. The second maps column families to their ''column keys''. The third one maps column keys'' to their ''timestamps''. Finally, the last one maps the timestamps to a single value. The keys are typically strings, the timestamp is a long and the value is an uninterpreted array of bytes. The + To put it simply, HBase can be reduced to a Map<byte[], Map<byte[], Map<byte[], Map<long, byte[]>>>>. The first Map maps row keys to their ''column families''. The second maps column families to their ''column keys''. The third one maps column keys to their ''timestamps''. Finally, the last one maps the timestamps to a single value. The keys are typically strings, the timestamp is a long and the value is an uninterpreted array of bytes. The column key is always preceded by its family and is represented like this: ''family:key''. Since a family maps to another map, this means that a single column family can contain a theoretical infinity of column keys. So, to retrieve a single value, the user has to do a ''get'' using three keys: row key+column key+timestamp -> value + [[Anchor(row)]] + = Rows = + + The row key is treated by HBase as an array of bytes but it must have a string representation. A special property of the row key Map is that it keeps them in a lexicographical order. For example, numbers going from 1 to 100 will be ordered like this: + 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,...,9,91,92,93,94,95,96,97,98,99 + + To keep the integers natural ordering, the row keys have to be left-padded with zeros. To take advantage of this, the functionalities of the row key Map are augmented by offering a scanner which takes a ''start row key'' (if not specified, the first one in the table) and an ''stop row key'' (if not specified, the last one in the table). For example, if the row keys are dates in the format YYYYMMDD, getting the month of July 2008 is a matter of opening a scanner from ''20080700'' to ''20080800''. It does not matter if the specified row keys are existing or not, the only thing to keep in mind is that the stop row key will not be returned which is why the first of August is given to the scanner. + + [[Anchor(columns)]] + = Column Families = + + A column family regroups data of a same nature in HBase and has no constraint on the type. The families are part of the table schema and stay the same for each row; what differs from rows to rows is that the column keys can be very sparse. For example, row "20080702" may have in it's "info:" family the following column keys: + ||info:aaa|| + ||info:bbb|| + ||info:ccc|| + While row "20080703" only has: + ||info:12342|| + Developers have to be very careful when using column keys since a key with a length of zero is permitted which means that in the previous example data can be inserted in column key "info:". We strongly suggest using empty column keys only when no other keys will be specified. Also, since the data in a family has the same nature, many attributes can be specified regarding [#famatt performance and timestamps]. + + [[Anchor(ts)]] + = Timestamps = + + The values in HBase may have multiple versions kept according to the family configuration. By default, HBase sets the timestamp to each new value to current time in milliseconds and returns the latest version when a cell is retrieved. The developer can also provide it's own timestamps when inserting data as he can specify a certain timestamp when fetching it. + + [[Anchor(famatt)]] + = Family Attributes = + + The following attributes can be specified or each families: + + Implemented + + * Compression + * Record: means that each exact values found at a rowkey+columnkey+timestamp will be compressed independently. + * Block: means that blocks in HDFS are compressed. A block may contain multiple records if they are shorter than one HDFS block or may only contain part of a record if the record is longer than a HDFS block. + * Timestamps + * Max number: the maximum number of different versions a value has. + * Time to live: versions older than specified time will be garbage collected. + + Still not implemented + + * In memory: all values of that family will be kept in memory. + * Length: values written will not be longer than the specified number of bytes. + + [[Anchor(example)]] + = Real Life Example = + + [[Anchor(relational)]] + == The Source ERD == + + [[Anchor(hbaseschema)]] + == The HBase Target Schema == +
