[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Apache Wiki Fri, 16 Apr 2010 17:39:53 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/HBaseBulkLoad" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=13&rev2=14

--------------------------------------------------

  
  The CREATE TABLE creates a dummy table which controls how the output of the 
sort is written.  Note that it uses {{{HiveHFileOutputFormat}}} to do this, 
with the table property {{{hfile.family.path}}} used to control the destination 
directory for the output.  Again, be sure to set the inputformat/outputformat 
exactly as specified.  In the example above, we select gzip (gz) compression 
for the result files; if you don't set the {{{hfile.compression}}} parameter, 
no compression will be performed.  (The other method available is lzo, which 
compresses less aggressively but does not require as much CPU power.)
  
+ There is a parameter {{{hbase.hregion.max.filesize}}} (default 256MB) which 
affects how HFiles are generated.  If the amount of data (pre-compression) 
produced by a reducer exceeds this limit, more than one HFile will be generated 
for that reducer.  This will lead to unbalanced region files.  This will not 
cause any correctness problems, but if you want to get balanced region files, 
either use more reducers or set this parameter to a larger value.  Note that 
when compression is enabled, you may see multiple files generated whose sizes 
are well below the limit; this is because the overflow check is done 
pre-compression.
+ 
  The {{{cf}}} in the path specifies the name of the column family which will 
be created in HBase, so the directory name you choose here is important.  (Note 
that we're not actually using an HBase table here; {{{HiveHFileOutputFormat}}} 
writes directly to files.)
  
  The CLUSTER BY clause provides the keys to be used by the partitioner; be 
sure that it matches the range partitioning that you came up with in the 
earlier step.

[Hadoop Wiki] Update of "Hive/HBaseBulkLoad" by JohnSic hi

Reply via email to