Nick Dimiduk created HBASE-8073:
-----------------------------------
Summary: HFileOutputFormat support for offline operation
Key: HBASE-8073
URL: https://issues.apache.org/jira/browse/HBASE-8073
Project: HBase
Issue Type: New Feature
Components: mapreduce
Reporter: Nick Dimiduk
When using HFileOutputFormat to generate HFiles, it inspects the region
topology of the target table. The split points from that table are used to
guide the TotalOrderPartitioner. If the target table does not exist, it is
first created. This imposes an unnecessary dependence on an online HBase and
existing table.
If the table exists, it can be used. However, the job can be smarter. For
example, if there's far more data going into the HFiles than the table
currently contains, the table regions aren't very useful for data split points.
Instead, the input data can be sampled to produce split points more meaningful
to the dataset. LoadIncrementalHFiles is already capable of handling divergence
between HFile boundaries and table regions, so this should not pose any
additional burdon at load time.
The proper method of sampling the data likely requires a custom input format
and an additional map-reduce job perform the sampling. See a relevant
implementation:
https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira