Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The "Hive/HBaseBulkLoad" page has been changed by JohnSichi. http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad?action=diff&rev1=13&rev2=14 -------------------------------------------------- The CREATE TABLE creates a dummy table which controls how the output of the sort is written. Note that it uses {{{HiveHFileOutputFormat}}} to do this, with the table property {{{hfile.family.path}}} used to control the destination directory for the output. Again, be sure to set the inputformat/outputformat exactly as specified. In the example above, we select gzip (gz) compression for the result files; if you don't set the {{{hfile.compression}}} parameter, no compression will be performed. (The other method available is lzo, which compresses less aggressively but does not require as much CPU power.) + There is a parameter {{{hbase.hregion.max.filesize}}} (default 256MB) which affects how HFiles are generated. If the amount of data (pre-compression) produced by a reducer exceeds this limit, more than one HFile will be generated for that reducer. This will lead to unbalanced region files. This will not cause any correctness problems, but if you want to get balanced region files, either use more reducers or set this parameter to a larger value. Note that when compression is enabled, you may see multiple files generated whose sizes are well below the limit; this is because the overflow check is done pre-compression. + The {{{cf}}} in the path specifies the name of the column family which will be created in HBase, so the directory name you choose here is important. (Note that we're not actually using an HBase table here; {{{HiveHFileOutputFormat}}} writes directly to files.) The CLUSTER BY clause provides the keys to be used by the partitioner; be sure that it matches the range partitioning that you came up with in the earlier step.
