[
https://issues.apache.org/jira/browse/HBASE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manukranth Kolloju updated HBASE-9815:
--------------------------------------
Attachment: Histogram-9815.diff
Attaching the implementation based on the above paper.
> Add Histogram representative of row key distribution inside a region.
> ---------------------------------------------------------------------
>
> Key: HBASE-9815
> URL: https://issues.apache.org/jira/browse/HBASE-9815
> Project: HBase
> Issue Type: New Feature
> Components: HFile
> Affects Versions: 0.89-fb
> Reporter: Manukranth Kolloju
> Assignee: Manukranth Kolloju
> Fix For: 0.89-fb
>
> Attachments: Histogram-9815.diff
>
>
> Using histogram information, users can parallelize the scan workload into
> equal sized scans based on the estimated size from the Histogram information.
> This will help in enabling systems which are trying to perform queries on top
> of HBase to do cost based optimization while scanning. The Idea is to keep
> this histogram information in the HFile in the trailer and populate this on
> compaction and flush.
> The HRegionInterface can expose an API to return the Histogram information of
> a region, which can be generated by merging histograms of all the hfiles.
> Implementing the histogram on the basis of
> http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
> http://dl.acm.org/citation.cfm?id=1951376
> and NumericHistogram from hive.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)