[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Apache Wiki Fri, 09 Apr 2010 14:15:50 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=5&rev2=6

--------------------------------------------------

- '''under construction'''
- 
  This page explains how to use Hive to bulk load data into a new (empty) HBase 
table per [[https://issues.apache.org/jira/browse/HIVE-1295|HIVE-1295]].  (If 
you're not using a build which contains this functionality yet, you'll need to 
build from source and make sure this patch is applied.)
  
  = Overview =
@@ -38, +36 @@

  
  Currently there are a number of constraints here:
  
- * The target table must be new (you can't bulk load into an existing table)
+  * The target table must be new (you can't bulk load into an existing table)
- * The target table can only have a single column family 
([[http://issues.apache.org/jira/browse/HBASE-1861|HBASE-1861]])
+  * The target table can only have a single column family 
([[http://issues.apache.org/jira/browse/HBASE-1861|HBASE-1861]])
- * The target table cannot be sparse (every row will have the same set of 
columns); this should be easy to fix by either allowing a MAP value to be read 
from Hive, and/or by allowing rows to be read from Hive in pivoted form (one 
row per HBase cell)
+  * The target table cannot be sparse (every row will have the same set of 
columns); this should be easy to fix by either allowing a MAP value to be read 
from Hive, and/or by allowing rows to be read from Hive in pivoted form (one 
row per HBase cell)
  
  Besides dealing with these constraints, probably the most important work here 
is deciding on how you want to assign an HBase row key to each row coming from 
Hive.  To avoid inconsistencies between lexical and binary comparators, it is 
simplest to design a string row key and use it consistently all the way 
through.  If you want to combine multiple columns into the key, use Hive's 
string concat expression for this purpose.  You can use CREATE VIEW to tack on 
your rowkey logically without having to update any existing data in Hive.
  
@@ -123, +121 @@

  cluster by transaction_id;
  }}}
  
- The CREATE TABLE creates a dummy table which controls how the output of the 
sort is written.  Note that it uses {{{HiveHFileOutputFormat}}} to do this, 
with the table property {{{hfile.family.path}}} used to control the destination 
directory for the output.  Again, be sure to set the inputformat/outputformat 
exactly as specified.  
+ The CREATE TABLE creates a dummy table which controls how the output of the 
sort is written.  Note that it uses {{{HiveHFileOutputFormat}}} to do this, 
with the table property {{{hfile.family.path}}} used to control the destination 
directory for the output.  Again, be sure to set the inputformat/outputformat 
exactly as specified.
  
  The {{{cf}}} in the path specifies the name of the column family which will 
be created in HBase, so the directory name you choose here is important.  (Note 
that we're not actually using an HBase table here; {{{HiveHFileOutputFormat}}} 
writes directly to files.)
  
@@ -137, +135 @@

  
  If Hive and HBase are running in different clusters, use 
[[http://hadoop.apache.org/common/docs/current/distcp.html|distcp]] to copy the 
files from one to the other.
  
- Once the files are in the HBase cluster, use the {{{bin/loadtable.rb}}} 
script which comes with HBase to import:
+ Once the files are in the HBase cluster, use the {{{bin/loadtable.rb}}} 
script which comes with HBase to import them:
  
  {{{
  hbase org.jruby.Main loadtable.rb transactions /tmp/hbsort
@@ -147, +145 @@

  
  After this script finishes, you may need to wait a minute or two for the new 
table to be picked up by the HBase meta scanner.  Use the hbase shell to verify 
that the new table was created correctly, and do some sanity queries to locate 
individual cells and make sure they can be found.
  
+ = Map New Table Back Into Hive =
+ 
+ Finally, if you'd like to access the HBase table you just created via Hive:
+ 
+ {{{
+ CREATE EXTERNAL TABLE hbase_transactions(transaction_id string, user_name 
string, amount double, ...) 
+ STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
+ WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:user_name,cf:amount,...")
+ TBLPROPERTIES("hbase.table.name" = "transactions");
+ }}}
+ 
+ = Followups Needed =
+ 
+  * Support sparse tables
+  * Support loading binary data representations once HIVE-1245 is fixed
+  * Support assignment of timestamps
+  * Provide control over file parameters such as compression
+  * Support multiple column families once HBASE-1861 is implemented
+  * Wrap it all up into the ideal single-INSERT-with-auto-sampling job...
+

[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Reply via email to