Nick Dimiduk created HBASE-8073:
-----------------------------------

             Summary: HFileOutputFormat support for offline operation
                 Key: HBASE-8073
                 URL: https://issues.apache.org/jira/browse/HBASE-8073
             Project: HBase
          Issue Type: New Feature
          Components: mapreduce
            Reporter: Nick Dimiduk


When using HFileOutputFormat to generate HFiles, it inspects the region 
topology of the target table. The split points from that table are used to 
guide the TotalOrderPartitioner. If the target table does not exist, it is 
first created. This imposes an unnecessary dependence on an online HBase and 
existing table.

If the table exists, it can be used. However, the job can be smarter. For 
example, if there's far more data going into the HFiles than the table 
currently contains, the table regions aren't very useful for data split points. 
Instead, the input data can be sampled to produce split points more meaningful 
to the dataset. LoadIncrementalHFiles is already capable of handling divergence 
between HFile boundaries and table regions, so this should not pose any 
additional burdon at load time.

The proper method of sampling the data likely requires a custom input format 
and an additional map-reduce job perform the sampling. See a relevant 
implementation: 
https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to