Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/HBaseIntegration" page has been changed by JohnSichi. http://wiki.apache.org/hadoop/Hive/HBaseIntegration?action=diff&rev1=10&rev2=11 -------------------------------------------------- within HBase. If not specified, then the Hive and HBase table names will be identical. - After executing the command above, you should be able to see the new table in the HBase shell: + After executing the command above, you should be able to see the new (empty) table in the HBase shell: {{{ $ hbase shell @@ -64, +64 @@ hbase(main):001:0> list xyz 1 row(s) in 0.0530 seconds - hbase(main):002:0> describe 'xyz' + hbase(main):002:0> describe "xyz" DESCRIPTION ENABLED {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', COMPRESSION => 'NONE', VE true RSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0220 seconds + hbase(main):003:0> scan "xyz" + ROW COLUMN+CELL + 0 row(s) in 0.0060 seconds }}} + Notice that even though a column name "val" is specified in the mapping, only the column family name "cf1" appears in the DESCRIBE output in the HBase shell. This is because in HBase, only column families (not columns) are known in the table-level metadata; column names within a column family are only present at the per-row level. + + Here's how to move data from Hive into the HBase table: + + {{{ + INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98; + }}} + + Use HBase shell to verify that the data actually got loaded: + + {{{ + hbase(main):009:0> scan "xyz" + ROW COLUMN+CELL + 98 column=cf1:val, timestamp=1267737987733, value=val_98 + 1 row(s) in 0.0110 seconds + }}} + + And then query it back via Hive: + + {{{ + hive> select * from hbase_table_1; + Total MapReduce jobs = 1 + Launching Job 1 out of 1 + ... + OK + 98 val_98 + Time taken: 4.582 seconds + }}} + - If instead you want to give Hive access to an existing HBase table, + If you want to give Hive access to an existing HBase table, use CREATE EXTERNAL TABLE: {{{ @@ -93, +125 @@ The column mapping support currently available is somewhat cumbersome and restrictive: - * the first column in the Hive table automatically becomes the key in the HBase table + * the first column in the Hive table always maps to the key in the HBase table - * for each subsequent Hive column, the table creator must specify a corresponding entry in the comma-delimited {{{hbase.columns.mapping}}} string (so for a Hive table with n columns, the string should have n-1 entries) + * for each subsequent Hive column, the table creator must specify a corresponding entry in the comma-delimited {{{hbase.columns.mapping}}} string (so for a Hive table with n columns, the string should have n-1 entries); whitespace should '''not''' be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want * a mapping entry is of the form {{{column-family-name:[column-name]}}} * if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns + * there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp. + * since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column - TBD: details on how primitive and map values are serialized + The next few sections provide detailed examples of the kinds of column mappings currently possible. + + == Multiple Columns and Families == + + Here's an example with three Hive columns and two HBase column + families, with two of the Hive columns ({{{value1}}} and {{{value2}}}) + corresponding to one of the column families ({{{a}}}, with HBase + column names {{{b}}} and {{{c}}}), and the other Hive column + corresponding to a single column ({{{e}}}) in its own column family + ({{{d}}}). + + {{{ + CREATE TABLE hbase_table_1(key int, value1 string, value2 int, value3 int) + STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' + WITH SERDEPROPERTIES ( + "hbase.columns.mapping" = "a:b,a:c,d:e" + ); + INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, bar, foo+1, foo+2 + FROM pokes WHERE foo=98 OR foo=100; + }}} + + Here's how this looks in HBase: + + {{{ + hbase(main):014:0> describe "hbase_table_1" + DESCRIPTION ENABLED + {NAME => 'hbase_table_1', FAMILIES => [{NAME => 'a', COMPRESSION => 'N true + ONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_M + EMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'd', COMPRESSION => + 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN + _MEMORY => 'false', BLOCKCACHE => 'true'}]} + 1 row(s) in 0.0170 seconds + hbase(main):015:0> scan "hbase_table_1" + ROW COLUMN+CELL + 100 column=a:b, timestamp=1267740457648, value=val_100 + 100 column=a:c, timestamp=1267740457648, value=101 + 100 column=d:e, timestamp=1267740457648, value=102 + 98 column=a:b, timestamp=1267740457648, value=val_98 + 98 column=a:c, timestamp=1267740457648, value=99 + 98 column=d:e, timestamp=1267740457648, value=100 + 2 row(s) in 0.0240 seconds + }}} + + And when queried back into Hive: + + {{{ + hive> select * from hbase_table_1; + Total MapReduce jobs = 1 + Launching Job 1 out of 1 + ... + OK + 100 val_100 101 102 + 98 val_98 99 100 + Time taken: 4.054 seconds + }}} + + == Hive MAP to HBase Column Family == + + Here's how a Hive MAP datatype can be used to access an entire column + family. Each row can have a different set of columns, where the + column names correspond to the map keys and the column values + correspond to the map values. + + {{{ + CREATE TABLE hbase_table_1(key int, value map<string,int>) + STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' + WITH SERDEPROPERTIES ( + "hbase.columns.mapping" = "cf:" + ); + INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, map(bar, foo) FROM pokes + WHERE foo=98 OR foo=100; + }}} + + Here's how this looks in HBase (with different column names in different rows): + + {{{ + hbase(main):012:0> scan "hbase_table_1" + ROW COLUMN+CELL + 100 column=cf:val_100, timestamp=1267739509194, value=100 + 98 column=cf:val_98, timestamp=1267739509194, value=98 + 2 row(s) in 0.0080 seconds + }}} + + And when queried back into Hive: + + {{{ + hive> select * from hbase_table_1; + Total MapReduce jobs = 1 + Launching Job 1 out of 1 + ... + OK + 100 {"val_100":100} + 98 {"val_98":98} + Time taken: 3.808 seconds + }}} + + Note that the key of the MAP must have datatype string, since it is + used for naming the HBase column, so the following table definition will fail: + + {{{ + CREATE TABLE hbase_table_1(key int, value map<int,int>) + STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' + WITH SERDEPROPERTIES ( + "hbase.columns.mapping" = "cf:" + ); + }}} + + == Illegal: Hive Primitive to HBase Column Family == + + Table definitions such as the following should be avoided because a + Hive column mapped to a column family must have MAP type: + + {{{ + CREATE TABLE hbase_table_1(key int, value string) + STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' + WITH SERDEPROPERTIES ( + "hbase.columns.mapping" = "cf:" + ); + }}} + + Currently, the CREATE TABLE will succeed, but attempts to insert data + will fail with this internal error: + + {{{ + java.lang.RuntimeException: org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyStringObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector + }}} + + An improvement would be to catch this at CREATE TABLE time and reject + it as invalid. = Potential Followups = + + There are a number of areas where Hive/HBase integration could definitely use more love: * more flexible column mapping (HIVE-806) * default column mapping in cases where no mapping spec is given * filter/projection pushdown - * implement virtual partitions corresponding to HBase timestamps + * expose timestamp attribute, possibly also with support for treating it as a partition key * allow per-table hbase.master configuration * run profiler and minimize any per-row overhead in column mapping * user defined routines for lookups and data loads via HBase client API (HIVE-758 and HIVE-791) - * support a fast-path mode in which no map/reduce is used for simple queries (go through HBase client API instead?) + * support a fast-path mode in which no map/reduce is used for simple queries and loads (go through HBase client API instead) * logging is very noisy, with a lot of spurious exceptions; investigate these and either fix their cause or squelch them = Build = @@ -133, +297 @@ Positive QL tests are under {{{hbase-handler/src/test/queries}}}. These use a HBase+Zookeeper mini-cluster for hosting the fixture tables in-process, so no free-standing HBase installation is needed in order to run them. - Run them like this: + The QL tests can be executed via ant like this: {{{ ant test -Dtestcase=TestHBaseCliDriver -Dqfile=hbase_queries.q
