Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/HBaseBulkLoad" page has been changed by JohnSichi. http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad -------------------------------------------------- New page: '''under construction''' This page explains how to use Hive to bulk load data into a new (empty) HBase table per [[https://issues.apache.org/jira/browse/HIVE-1295|HIVE-1295]]. = Overview = Ideally, bulk load from Hive into HBase would be as simple as this: {{{ CREATE TABLE new_hbase_table(rowkey string, x int, y int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf:x,cf:y"); SET hive.hbase.bulk=true; INSERT OVERWRITE new_hbase_table SELECT ... FROM hive_query; }}} However, things aren't ''quite'' as simple as that yet. Instead, a multistep procedure is required involving both SQL and shell script commands. It should still be a lot easier and more flexible than writing your own map/reduce program, and over time we can enhance Hive to move closer to the ideal. The procedure is based on [[http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk|underlying HBase recommendations]], and involves the following steps: 1. Decide on the number of reducers you're planning to use for parallelizing the sorting and HFile creation. This depends on the size of your data as well as cluster resources available. 1. Run Hive commands which will create a file containing "splitter" keys which will be used for range-partitioning the data during sort. 1. Prepare a staging location in HDFS where the HFiles will be generated. 1. Run Hive commands which will execute the sort and generate the HFiles. 1. (Optional: if HBase and Hive are running in different clusters, distcp the generated files from the Hive cluster to the HBase cluster.) 1. Run HBase script {{{loadtable.rb}}} to move the files into a new HBase table. 1. (Optional: register the HBase table as an external table in Hive so you can access it from there.) The rest of this page explains each step in greater detail.
