[
https://issues.apache.org/jira/browse/HBASE-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492975#comment-14492975
]
Nick Dimiduk commented on HBASE-8073:
-------------------------------------
Reviewing comments and the patch. I think this is a good incremental step, but
doesn't achieve my intended use-case. What I had in mind is an ETL operation
that's generating some dataset (incremental or otherwise). It's run from some
infra that's decoupled from the online serving tier (ie, maybe generating the
dataset with EMR/S3 while serving infra is on-prem). The ETL process knows it's
generating HFiles, but that's about it. We have the data set, we know what the
rowkeys will be -- there's no reason to make the user think about split points.
> HFileOutputFormat support for offline operation
> -----------------------------------------------
>
> Key: HBASE-8073
> URL: https://issues.apache.org/jira/browse/HBASE-8073
> Project: HBase
> Issue Type: Sub-task
> Components: mapreduce
> Reporter: Nick Dimiduk
> Fix For: 1.1.0
>
> Attachments: HBASE-8073-trunk-v0.patch, HBASE-8073-trunk-v1.patch
>
>
> When using HFileOutputFormat to generate HFiles, it inspects the region
> topology of the target table. The split points from that table are used to
> guide the TotalOrderPartitioner. If the target table does not exist, it is
> first created. This imposes an unnecessary dependence on an online HBase and
> existing table.
> If the table exists, it can be used. However, the job can be smarter. For
> example, if there's far more data going into the HFiles than the table
> currently contains, the table regions aren't very useful for data split
> points. Instead, the input data can be sampled to produce split points more
> meaningful to the dataset. LoadIncrementalHFiles is already capable of
> handling divergence between HFile boundaries and table regions, so this
> should not pose any additional burdon at load time.
> The proper method of sampling the data likely requires a custom input format
> and an additional map-reduce job perform the sampling. See a relevant
> implementation:
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)