[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

Vincent Poon (JIRA) Mon, 07 May 2018 13:48:24 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466452#comment-16466452
 ]


Vincent Poon commented on PHOENIX-4724:
---------------------------------------

[~jamestaylor] I wrote this, I forgot to add the Apache license - will do that 
for the next revision.

Current use case is the parent Jira PHOENIX-4704, for pre-splitting an index 
table.  In that Jira i plan to scan or sample the data table, generating the 
index rowkey values and feeding them into this histogram.  Then afterwards I 
can use the histogram bounds to create the index table with the proper splits.  
I'm thinking will be done in the IndexTool, though we can possibly put it in 
createTableInternal somewhere as an option as well.

In the future we could also add a table option to create this histogram at 
compaction time, and maintain it in memory.  There's still work to be done:
 * I haven't investigated update/deletes yet, which [~aertoria] also inquired 
about.  Right now it only supports adding values, and can't distinguish updates 
from inserts (I think to do that we would need a count-min sketch or counting 
bloom filter implementation)
 * need to add functionality to be able to merge multiple histograms (e.g. from 
multiple different regions).  

> Efficient Equi-Depth histogram for streaming data
> -------------------------------------------------
>
>                 Key: PHOENIX-4724
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4724
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Vincent Poon
>            Assignee: Vincent Poon
>            Priority: Major
>         Attachments: PHOENIX-4724.v1.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

Reply via email to