[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Apache Wiki Fri, 09 Apr 2010 12:46:11 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=3&rev2=4

--------------------------------------------------

  SET hive.hbase.bulk=true;
  
  INSERT OVERWRITE new_hbase_table
- SELECT rowid_expression, x, y FROM ...any_hive_query...;
+ SELECT rowkey_expression, x, y FROM ...any_hive_query...;
  }}}
  
  However, things aren't ''quite'' as straightforward as that yet.  Instead, a 
procedure involving a series of SQL commands is required.  It should still be a 
lot easier and more flexible than writing your own map/reduce program, and over 
time we hope to enhance Hive to move closer to the ideal.
@@ -39, +39 @@

  
  = Prepare Range Partitioning =
  
- In order to perform a parallel sort on the data, we need to range-partition 
it.  The idea is to divide the space of row keys up into nearly equal-sized 
ranges, one per reducer.  The details will vary according to your source data, 
and you may need to run a number of exploratory Hive queries in order to come 
up with a good enough set of ranges.  As an oversimplified example, suppose 
your row keys are transaction timestamps, you have a year's worth of data 
starting from January, your data growth is constant month-over-month, and you 
want to run 12 reducers.  In that case, you could use a query such as this one:
+ In order to perform a parallel sort on the data, we need to range-partition 
it.  The idea is to divide the space of row keys up into nearly equal-sized 
ranges, one per reducer.  The details will vary according to your source data, 
and you may need to run a number of exploratory Hive queries in order to come 
up with a good enough set of ranges.  As a highly contrived example, suppose 
your row keys are sequence-generated transaction ID strings (possibly with 
gaps), you have a year's worth of data starting from January, your data growth 
is constant month-over-month, and you want to run 12 reducers.  In that case, 
you could use a query such as this one:
  
  {{{
  select transaction_id
@@ -51, +51 @@

  limit 11
  }}}
  
- Note that we only want 11 values for breaking the data into 12 ranges, so we 
drop the max timestamp for the last month.  Also note that there are usually 
much cheaper ways to come up with good split keys; '''this is just an 
example'''.
+ Note that we only want 11 values for breaking the data into 12 ranges, so we 
drop the max timestamp for the last month.  Also note that the ORDER BY is 
necessary for producing the range start keys in ascending order.
+ 
+ ''Important:'' there are usually much cheaper ways to come up with good split 
keys; '''this is just an example to give you an idea of the kind of result your 
sampling query should produce'''.
+ 
+ In this example, our partitioning key is a single column.  In theory, you 
could partition over a compound key, but if you're going to do this, it might 
be safest to pre-concatenate your keys into a single string column (maybe using 
a view) which will serve as your HBase rowkey; this will avoid potential 
inconsistencies in comparators later on.
+ 
+ Once you have your sampling query defined, the next step is to save its 
results to a properly formatted file which will be used in a subsequent step.  
To do this, run commands like the following:
+ 
+ {{{
+ create external table hb_range_keys(transaction_id_range_start string)
+ row format serde 
+ 'org.apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe'
+ stored as 
+ inputformat 
+ 'org.apache.hadoop.mapred.TextInputFormat'
+ outputformat 
+ 'org.apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat'
+ location '/tmp/hb_range_keys';
+ 
+ insert overwrite table hb_range_keys
+ select transaction_id
+ from
+ (select month,max(transaction_id) as transaction_id
+  from transactions
+  group by month) m
+ order by transaction_id
+ limit 11;
+ }}}
+ 
+ The first command creates an external table defining the format of the file 
to be created; be sure to set the serde and inputformat/outputformat exactly as 
specified.  
+ 
+ The second command populates it (using the sampling query previously 
defined).  Usage of ORDER BY guarantees that a single file will be produced in 
directory {{{/tmp/hb_range_keys}}}.  The filename is unknown, but it is 
necessary to reference the file by name later, so run a command such as the 
following to copy it to a specific name:
+ 
+ {{{
+ dfs -cp /tmp/hb_range_keys/* /tmp/hb_range_key_list
+ }}}
  
  = Prepare Staging Location =
  
+ The sort is going to produce a lot of data, so make sure you have sufficient 
space in your HDFS cluster, and choose the location where the files will be 
staged.  We'll use {{{/tmp/hbsort}}} in this example.
+ 
  = Sort Data =
+ 
+ Now comes the big step:  running a sort over all of the data to be bulk 
loaded.  Make sure that your Hive instance has {{{hive_hbase-handler.jar}}} 
available on its auxpath.
+ 
+ {{{
+ set mapred.reduce.tasks=12;
+ set 
hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
+ set total.order.partitioner.natural.order=false;
+ set total.order.partitioner.path=/tmp/hb_range_key_list;
+ 
+ create table hbsort(transaction_id string, user_name string, amount double, 
...)
+ stored as
+ INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
+ OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHFileOutputFormat'
+ TBLPROPERTIES ('hfile.family.path' = '/tmp/hbsort/cf');
+ 
+ insert overwrite table hbsort
+ select transaction_id, user_name, amount, ...
+ from transactions
+ cluster by transaction_id;
+ }}}
+ 
+ The CREATE TABLE creates a dummy table which controls how the output of the 
sort is written.  Note that it uses {{{HiveHFileOutputFormat}}} to do this, 
with the table property {{{hfile.family.path}}} used to control the destination 
directory for the output.  Again, be sure to set the inputformat/outputformat 
exactly as specified.  
+ 
+ The {{{cf}}} in the path specifies the name of the column family which will 
be created in HBase, so the directory name you choose here is important.  (Note 
that we're not actually using an HBase table here; {{{HiveHFileOutputFormat}}} 
writes directly to files.)
+ 
+ The CLUSTER BY clause provides the keys to be used by the partitioner; be 
sure that it matches the range partitioning that you came up with in the 
earlier step.
+ 
+ The first column in the SELECT list is interpreted as the rowkey; subsequent 
columns become cell values (all in a single column family, so their column 
names are important.
  
  = Run HBase Script =
  
+ Once the sort job completes successfully, one final step is required for 
importing the result files into HBase.
+ 
+ If Hive and HBase are running in different clusters, use 
[[http://hadoop.apache.org/common/docs/current/distcp.html|distcp]] to copy the 
files from one to the other.
+ 
+ Once the files are in the HBase cluster, use the {{{bin/loadtable.rb}}} 
script which comes with HBase to import:
+ 
+ {{{
+ hbase org.jruby.Main loadtable.rb transactions /tmp/hbsort
+ }}}
+ 
+ The first argument ({{{transactions}}}) specifies the name of the new HBase 
table.  For the second argument, pass the staging directory name, not the the 
column family child directory.
+ 
+ After this script finishes, you may need to wait a minute or two for the new 
table to be picked up by the HBase meta scanner.  Use the hbase shell to verify 
that the new table was created correctly, and do some sanity queries to locate 
individual cells and make sure they can be found.
+

[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Reply via email to