[Hadoop Wiki] Update of "Hive/HBaseIntegration" by John Sichi

Apache Wiki Thu, 04 Mar 2010 14:48:51 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/HBaseIntegration" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseIntegration?action=diff&rev1=11&rev2=12

--------------------------------------------------

  
  Notice that even though a column name "val" is specified in the mapping, only 
the column family name "cf1" appears in the DESCRIBE output in the HBase shell. 
 This is because in HBase, only column families (not columns) are known in the 
table-level metadata; column names within a column family are only present at 
the per-row level.
  
- Here's how to move data from Hive into the HBase table:
+ Here's how to move data from Hive into the HBase table (see 
[[Hive/GettingStarted]] for how to create the example table {{{pokes}}} in Hive 
first):
  
  {{{
  INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;
@@ -262, +262 @@

  An improvement would be to catch this at CREATE TABLE time and reject
  it as invalid.
  
+ = Key Uniqueness =
+ 
+ One subtle difference between HBase tables and Hive tables is that HBase 
tables have a unique key, whereas Hive tables do not.  When multiple rows with 
the same key are inserted into HBase, only one of them is stored (the choice is 
arbitrary, so do not rely on HBase to pick the right one).  This is in contrast 
to Hive, which is happy to store multiple rows with the same key and different 
values.
+ 
+ For example, the pokes table contains rows with duplicate keys.  If it is 
copied into another Hive table, the duplicates are preserved:
+ 
+ {{{
+ CREATE TABLE pokes2(foo INT, bar STRING);
+ INSERT OVERWRITE TABLE pokes2 SELECT * FROM pokes;
+ -- this will return 3
+ SELECT COUNT(1) FROM POKES WHERE foo=498;
+ -- this will also return 3
+ SELECT COUNT(1) FROM pokes2 WHERE foo=498;
+ }}}
+ 
+ But in HBase, the duplicates are silently eliminated:
+ 
+ {{{
+ CREATE TABLE pokes3(foo INT, bar STRING)
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES (
+ "hbase.columns.mapping" = "cf:bar"
+ );
+ INSERT OVERWRITE TABLE pokes3 SELECT * FROM pokes;
+ -- this will return 1 instead of 3
+ SELECT COUNT(1) FROM pokes3 WHERE foo=498;
+ }}}
+ 
  = Potential Followups =
  
  There are a number of areas where Hive/HBase integration could definitely use 
more love:

[Hadoop Wiki] Update of "Hive/HBaseIntegration" by John Sichi

Reply via email to