[ 
https://issues.apache.org/jira/browse/HBASE-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13955406#comment-13955406
 ] 

Nick Dimiduk commented on HBASE-8073:
-------------------------------------

That's an interesting idea. What about FS permissions? There's no assumption 
that a job-submitting user would have read access to the HBase user's data. 
Plus, it requires the data structure be available -- ie, access to the same 
hdfs as is running HBase. The appeal of sampling is that all it requires of 
HBase is knowledge of the on-disk file format. Nothing about the destination is 
necessary. The complexity isn't that bad, just a second (short) MR job.

> HFileOutputFormat support for offline operation
> -----------------------------------------------
>
>                 Key: HBASE-8073
>                 URL: https://issues.apache.org/jira/browse/HBASE-8073
>             Project: HBase
>          Issue Type: Sub-task
>          Components: mapreduce
>            Reporter: Nick Dimiduk
>
> When using HFileOutputFormat to generate HFiles, it inspects the region 
> topology of the target table. The split points from that table are used to 
> guide the TotalOrderPartitioner. If the target table does not exist, it is 
> first created. This imposes an unnecessary dependence on an online HBase and 
> existing table.
> If the table exists, it can be used. However, the job can be smarter. For 
> example, if there's far more data going into the HFiles than the table 
> currently contains, the table regions aren't very useful for data split 
> points. Instead, the input data can be sampled to produce split points more 
> meaningful to the dataset. LoadIncrementalHFiles is already capable of 
> handling divergence between HFile boundaries and table regions, so this 
> should not pose any additional burdon at load time.
> The proper method of sampling the data likely requires a custom input format 
> and an additional map-reduce job perform the sampling. See a relevant 
> implementation: 
> https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to