[Hadoop Wiki] Update of "Hive/HBaseIntegration" by John Sichi

Apache Wiki Thu, 04 Mar 2010 14:19:44 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/HBaseIntegration" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseIntegration?action=diff&rev1=10&rev2=11

--------------------------------------------------

  within HBase.  If not specified, then the Hive and HBase table names
  will be identical.
  
- After executing the command above, you should be able to see the new table in 
the HBase shell:
+ After executing the command above, you should be able to see the new (empty) 
table in the HBase shell:
  
  {{{
  $ hbase shell
@@ -64, +64 @@

  hbase(main):001:0> list
  xyz                                                                           
                                
  1 row(s) in 0.0530 seconds
- hbase(main):002:0> describe 'xyz'
+ hbase(main):002:0> describe "xyz"
  DESCRIPTION                                                             
ENABLED                               
   {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', COMPRESSION => 'NONE', VE true  
                                
   RSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY =>       
                                
    'false', BLOCKCACHE => 'true'}]}                                            
                                
  1 row(s) in 0.0220 seconds
+ hbase(main):003:0> scan "xyz"
+ ROW                          COLUMN+CELL                                      
                                
+ 0 row(s) in 0.0060 seconds
  }}}
  
+ Notice that even though a column name "val" is specified in the mapping, only 
the column family name "cf1" appears in the DESCRIBE output in the HBase shell. 
 This is because in HBase, only column families (not columns) are known in the 
table-level metadata; column names within a column family are only present at 
the per-row level.
+ 
+ Here's how to move data from Hive into the HBase table:
+ 
+ {{{
+ INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;
+ }}}
+ 
+ Use HBase shell to verify that the data actually got loaded:
+ 
+ {{{
+ hbase(main):009:0> scan "xyz"
+ ROW                          COLUMN+CELL                                      
                                
+  98                          column=cf1:val, timestamp=1267737987733, 
value=val_98                            
+ 1 row(s) in 0.0110 seconds
+ }}}
+ 
+ And then query it back via Hive:
+ 
+ {{{
+ hive> select * from hbase_table_1;
+ Total MapReduce jobs = 1
+ Launching Job 1 out of 1
+ ...
+ OK
+ 98    val_98
+ Time taken: 4.582 seconds
+ }}}
+ 
- If instead you want to give Hive access to an existing HBase table,
+ If you want to give Hive access to an existing HBase table,
  use CREATE EXTERNAL TABLE:
  
  {{{
@@ -93, +125 @@

  The column mapping support currently available is somewhat
  cumbersome and restrictive:
  
-  * the first column in the Hive table automatically becomes the key in the 
HBase table
+  * the first column in the Hive table always maps to the key in the HBase 
table
-  * for each subsequent Hive column, the table creator must specify a 
corresponding entry in the comma-delimited {{{hbase.columns.mapping}}} string 
(so for a Hive table with n columns, the string should have n-1 entries)
+  * for each subsequent Hive column, the table creator must specify a 
corresponding entry in the comma-delimited {{{hbase.columns.mapping}}} string 
(so for a Hive table with n columns, the string should have n-1 entries); 
whitespace should '''not''' be used in between entries since these will be 
interperted as part of the column name, which is almost certainly not what you 
want
   * a mapping entry is of the form {{{column-family-name:[column-name]}}}
   * if no column-name is given, then the Hive column will map to all columns 
in the corresponding HBase column family, and the Hive MAP datatype must be 
used to allow access to these (possibly sparse) columns
+  * there is currently no way to access the HBase timestamp attribute, and 
queries always access data with the latest timestamp.
+  * since HBase does not associate datatype information with columns, the 
serde converts everything to string representation before storing it in HBase; 
there is currently no way to plug in a custom serde per column
  
- TBD: details on how primitive and map values are serialized
+ The next few sections provide detailed examples of the kinds of column 
mappings currently possible.
+ 
+ == Multiple Columns and Families ==
+ 
+ Here's an example with three Hive columns and two HBase column
+ families, with two of the Hive columns ({{{value1}}} and {{{value2}}})
+ corresponding to one of the column families ({{{a}}}, with HBase
+ column names {{{b}}} and {{{c}}}), and the other Hive column
+ corresponding to a single column ({{{e}}}) in its own column family
+ ({{{d}}}).
+ 
+ {{{
+ CREATE TABLE hbase_table_1(key int, value1 string, value2 int, value3 int) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "a:b,a:c,d:e"
+ );
+ INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, bar, foo+1, foo+2 
+ FROM pokes WHERE foo=98 OR foo=100;
+ }}}
+ 
+ Here's how this looks in HBase:
+ 
+ {{{
+ hbase(main):014:0> describe "hbase_table_1"
+ DESCRIPTION                                                             
ENABLED                               
+  {NAME => 'hbase_table_1', FAMILIES => [{NAME => 'a', COMPRESSION => 'N true  
                                
+  ONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_M       
                                
+  EMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'd', COMPRESSION =>        
                                
+  'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN       
                                
+  _MEMORY => 'false', BLOCKCACHE => 'true'}]}                                  
                                
+ 1 row(s) in 0.0170 seconds
+ hbase(main):015:0> scan "hbase_table_1"
+ ROW                          COLUMN+CELL                                      
                                
+  100                         column=a:b, timestamp=1267740457648, 
value=val_100                               
+  100                         column=a:c, timestamp=1267740457648, value=101   
                                
+  100                         column=d:e, timestamp=1267740457648, value=102   
                                
+  98                          column=a:b, timestamp=1267740457648, 
value=val_98                                
+  98                          column=a:c, timestamp=1267740457648, value=99    
                                
+  98                          column=d:e, timestamp=1267740457648, value=100   
                                
+ 2 row(s) in 0.0240 seconds
+ }}}
+ 
+ And when queried back into Hive:
+ 
+ {{{
+ hive> select * from hbase_table_1;
+ Total MapReduce jobs = 1
+ Launching Job 1 out of 1
+ ...
+ OK
+ 100   val_100 101     102
+ 98    val_98  99      100
+ Time taken: 4.054 seconds
+ }}}
+ 
+ == Hive MAP to HBase Column Family ==
+ 
+ Here's how a Hive MAP datatype can be used to access an entire column
+ family.  Each row can have a different set of columns, where the
+ column names correspond to the map keys and the column values
+ correspond to the map values.
+ 
+ {{{
+ CREATE TABLE hbase_table_1(key int, value map<string,int>) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:"
+ );
+ INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, map(bar, foo) FROM pokes 
+ WHERE foo=98 OR foo=100;
+ }}}
+ 
+ Here's how this looks in HBase (with different column names in different 
rows):
+ 
+ {{{
+ hbase(main):012:0> scan "hbase_table_1"
+ ROW                          COLUMN+CELL                                      
                                
+  100                         column=cf:val_100, timestamp=1267739509194, 
value=100                            
+  98                          column=cf:val_98, timestamp=1267739509194, 
value=98                              
+ 2 row(s) in 0.0080 seconds
+ }}}
+ 
+ And when queried back into Hive:
+ 
+ {{{
+ hive> select * from hbase_table_1;
+ Total MapReduce jobs = 1
+ Launching Job 1 out of 1
+ ...
+ OK
+ 100   {"val_100":100}
+ 98    {"val_98":98}
+ Time taken: 3.808 seconds
+ }}}
+ 
+ Note that the key of the MAP must have datatype string, since it is
+ used for naming the HBase column, so the following table definition will fail:
+ 
+ {{{
+ CREATE TABLE hbase_table_1(key int, value map<int,int>) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:"
+ );
+ }}}
+ 
+ == Illegal:  Hive Primitive to HBase Column Family ==
+ 
+ Table definitions such as the following should be avoided because a
+ Hive column mapped to a column family must have MAP type:
+ 
+ {{{
+ CREATE TABLE hbase_table_1(key int, value string) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:"
+ );
+ }}}
+ 
+ Currently, the CREATE TABLE will succeed, but attempts to insert data
+ will fail with this internal error:
+ 
+ {{{
+ java.lang.RuntimeException: 
org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyStringObjectInspector
 cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector
+ }}}
+ 
+ An improvement would be to catch this at CREATE TABLE time and reject
+ it as invalid.
  
  = Potential Followups =
+ 
+ There are a number of areas where Hive/HBase integration could definitely use 
more love:
  
   * more flexible column mapping (HIVE-806)
   * default column mapping in cases where no mapping spec is given
   * filter/projection pushdown
-  * implement virtual partitions corresponding to HBase timestamps
+  * expose timestamp attribute, possibly also with support for treating it as 
a partition key
   * allow per-table hbase.master configuration
   * run profiler and minimize any per-row overhead in column mapping
   * user defined routines for lookups and data loads via HBase client API 
(HIVE-758 and HIVE-791)
-  * support a fast-path mode in which no map/reduce is used for simple queries 
(go through HBase client API instead?)
+  * support a fast-path mode in which no map/reduce is used for simple queries 
and loads (go through HBase client API instead)
   * logging is very noisy, with a lot of spurious exceptions; investigate 
these and either fix their cause or squelch them
  
  = Build =
@@ -133, +297 @@

  Positive QL tests are under {{{hbase-handler/src/test/queries}}}.
  These use a HBase+Zookeeper mini-cluster for hosting the fixture
  tables in-process, so no free-standing HBase installation is needed in order 
to run them.
- Run them like this:
+ The QL tests can be executed via ant like this:
  
  {{{
  ant test -Dtestcase=TestHBaseCliDriver -Dqfile=hbase_queries.q

[Hadoop Wiki] Update of "Hive/HBaseIntegration" by John Sichi

Reply via email to