[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Apache Wiki Thu, 08 Apr 2010 17:46:51 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad

--------------------------------------------------

New page:
'''under construction'''

This page explains how to use Hive to bulk load data into a new (empty) HBase 
table per [[https://issues.apache.org/jira/browse/HIVE-1295|HIVE-1295]].  

= Overview =

Ideally, bulk load from Hive into HBase would be as simple as this:

{{{
CREATE TABLE new_hbase_table(rowkey string, x int, y int) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:x,cf:y");

SET hive.hbase.bulk=true;

INSERT OVERWRITE new_hbase_table
SELECT ... FROM hive_query;
}}}

However, things aren't ''quite'' as simple as that yet.  Instead, a multistep 
procedure is required involving both SQL and shell script commands.  It should 
still be a lot easier and more flexible than writing your own map/reduce 
program, and over time we can enhance Hive to move closer to the ideal.

The procedure is based on 
[[http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk|underlying
 HBase recommendations]], and involves the following steps:

 1. Decide on the number of reducers you're planning to use for parallelizing 
the sorting and HFile creation.  This depends on the size of your data as well 
as cluster resources available.
 1. Run Hive commands which will create a file containing "splitter" keys which 
will be used for range-partitioning the data during sort.
 1. Prepare a staging location in HDFS where the HFiles will be generated.
 1. Run Hive commands which will execute the sort and generate the HFiles.
 1. (Optional:  if HBase and Hive are running in different clusters, distcp the 
generated files from the Hive cluster to the HBase cluster.)
 1. Run HBase script {{{loadtable.rb}}} to move the files into a new HBase 
table.
 1. (Optional:  register the HBase table as an external table in Hive so you 
can access it from there.)

The rest of this page explains each step in greater detail.

[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Reply via email to