[
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466452#comment-16466452
]
Vincent Poon commented on PHOENIX-4724:
---------------------------------------
[~jamestaylor] I wrote this, I forgot to add the Apache license - will do that
for the next revision.
Current use case is the parent Jira PHOENIX-4704, for pre-splitting an index
table. In that Jira i plan to scan or sample the data table, generating the
index rowkey values and feeding them into this histogram. Then afterwards I
can use the histogram bounds to create the index table with the proper splits.
I'm thinking will be done in the IndexTool, though we can possibly put it in
createTableInternal somewhere as an option as well.
In the future we could also add a table option to create this histogram at
compaction time, and maintain it in memory. There's still work to be done:
* I haven't investigated update/deletes yet, which [~aertoria] also inquired
about. Right now it only supports adding values, and can't distinguish updates
from inserts (I think to do that we would need a count-min sketch or counting
bloom filter implementation)
* need to add functionality to be able to merge multiple histograms (e.g. from
multiple different regions).
> Efficient Equi-Depth histogram for streaming data
> -------------------------------------------------
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
> Issue Type: Sub-task
> Reporter: Vincent Poon
> Assignee: Vincent Poon
> Priority: Major
> Attachments: PHOENIX-4724.v1.patch
>
>
> Equi-Depth histogram from
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but
> without the sliding window - we assume a single window over the entire data
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new
> value.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)