[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487916#comment-16487916 ] Hudson commented on PHOENIX-4724: - FAILURE: Integrated in Jenkins build PreCommit-PHOENIX-Build #1885 (See [https://builds.apache.org/job/PreCommit-PHOENIX-Build/1885/]) PHOENIX-4724 Efficient Equi-Depth histogram for streaming data (vincentpoon: rev cb17adbbde56cacd43846ead2200e6606ed64ae8) * (add) phoenix-core/src/test/java/org/apache/phoenix/util/EquiDepthStreamHistogramTest.java * (add) phoenix-core/src/main/java/org/apache/phoenix/util/EquiDepthStreamHistogram.java > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Affects Versions: 4.14.0 >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Fix For: 4.14.0, 5.0.0 > > Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481431#comment-16481431 ] Vincent Poon commented on PHOENIX-4724: --- Pushed to 4.x-cdh5.11 also > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Affects Versions: 4.14.0 >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Fix For: 4.14.0, 5.0.0 > > Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481371#comment-16481371 ] James Taylor commented on PHOENIX-4724: --- Please don't forget the 4.x-cdh5.11 branch, [~vincentpoon]. Note that's the only cdh branch you should commit to. > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Affects Versions: 4.14.0 >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Fix For: 4.14.0, 5.0.0 > > Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481155#comment-16481155 ] Hudson commented on PHOENIX-4724: - FAILURE: Integrated in Jenkins build Phoenix-4.x-HBase-1.3 #137 (See [https://builds.apache.org/job/Phoenix-4.x-HBase-1.3/137/]) PHOENIX-4724 Efficient Equi-Depth histogram for streaming data (vincentpoon: rev 5935edd71873f9ec766ffe35000e96d2e48d) * (add) phoenix-core/src/test/java/org/apache/phoenix/util/EquiDepthStreamHistogramTest.java * (add) phoenix-core/src/main/java/org/apache/phoenix/util/EquiDepthStreamHistogram.java > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Affects Versions: 4.14.0 >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Fix For: 4.14.0, 5.0.0 > > Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481096#comment-16481096 ] Hudson commented on PHOENIX-4724: - SUCCESS: Integrated in Jenkins build Phoenix-4.x-HBase-0.98 #1897 (See [https://builds.apache.org/job/Phoenix-4.x-HBase-0.98/1897/]) PHOENIX-4724 Efficient Equi-Depth histogram for streaming data (vincentpoon: rev 865eb9a5362a0273cb85f6370b4470f03102a05a) * (add) phoenix-core/src/test/java/org/apache/phoenix/util/EquiDepthStreamHistogramTest.java * (add) phoenix-core/src/main/java/org/apache/phoenix/util/EquiDepthStreamHistogram.java > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Affects Versions: 4.14.0 >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Fix For: 4.14.0, 5.0.0 > > Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480191#comment-16480191 ] James Taylor commented on PHOENIX-4724: --- +1. Excellent work, [~vincentpoon] ! > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Affects Versions: 4.15.0 >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469198#comment-16469198 ] Maryann Xue commented on PHOENIX-4724: -- Yes, I agree with [~jamestaylor] that this information can be useful for the query optimizer. Right now for WHERE clause conditions, other than those filters on the primary key, we can only have a very rough "guess" of the number of rows/bytes of the filtered output. This information can definitely give a more accurate estimation for the filter conditions on columns covered by the histogram. For example, for a range or equal condition on such columns, we can estimate the filtered rows/bytes by calculating (number of buckets that fall in the range / number of total buckets). > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Affects Versions: 4.15.0 >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466452#comment-16466452 ] Vincent Poon commented on PHOENIX-4724: --- [~jamestaylor] I wrote this, I forgot to add the Apache license - will do that for the next revision. Current use case is the parent Jira PHOENIX-4704, for pre-splitting an index table. In that Jira i plan to scan or sample the data table, generating the index rowkey values and feeding them into this histogram. Then afterwards I can use the histogram bounds to create the index table with the proper splits. I'm thinking will be done in the IndexTool, though we can possibly put it in createTableInternal somewhere as an option as well. In the future we could also add a table option to create this histogram at compaction time, and maintain it in memory. There's still work to be done: * I haven't investigated update/deletes yet, which [~aertoria] also inquired about. Right now it only supports adding values, and can't distinguish updates from inserts (I think to do that we would need a count-min sketch or counting bloom filter implementation) * need to add functionality to be able to merge multiple histograms (e.g. from multiple different regions). > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Attachments: PHOENIX-4724.v1.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466376#comment-16466376 ] James Taylor commented on PHOENIX-4724: --- Great stuff, [~vincentpoon]. Couple of questions: * Does the histogram building take into account when a value is being overwritten again and again (since at write time we don't know if we're overwriting or not, unless the table is declared immutable)? Same question on delete of a column value. * How would you envision this be integrated with Phoenix? Maybe we could generate it when we update stats: at table compaction time and when UPDATE STATISTICS is called? Or depending on answers to above, would we try to constantly update the histogram as the data is mutating? * This information would definitely be useful for the optimizer. We have PHOENIX-1178, but not much detail there. [~maryannxue] could likely fill in how this information could be used. One big area is in estimating how much data will be filtered with a WHERE clause. Another is in how distinct a column is. * Did you write EquiDepthStreamHistogram.java or did you find it somewhere online? If that latter, how was it licensed? > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Attachments: PHOENIX-4724.v1.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466265#comment-16466265 ] Ethan Wang commented on PHOENIX-4724: - [~xucang] correct. But the distribution info is used at the moment of when splitting happens if I'm not mistaken. > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Attachments: PHOENIX-4724.v1.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466121#comment-16466121 ] Xu Cang commented on PHOENIX-4724: -- [~aertoria] I am not speaking for Vincent, but my understanding is, this method will be used when a user wants to build an index. This is a one-time effort based on current table situation (or you can call it a snapshot). So there is no use case requires removeValue() in this building index scenario. > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Attachments: PHOENIX-4724.v1.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464280#comment-16464280 ] Ethan Wang commented on PHOENIX-4724: - [~vincentpoon] I see. What happens when data table got mutated, what's the strategy for index table to sync up today. With that thinking, at class EquiDepthStreamHistogram, besides addValue(), does this algorithm support removeValue() as well? > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Attachments: PHOENIX-4724.v1.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463382#comment-16463382 ] Ethan Wang commented on PHOENIX-4724: - [~vincentpoon] If I understand correctly, with this feature implemented, when you build index table, you will at same time record some info into this histogram, so that in the future at some point you will conveniently get the distribution info of the index table. correct? So do you store a histogram obj for each index table like a shadow obj some where off line? Also, will there every be case that you need mutate index or remove index from a existing index table? Cool idea! > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Attachments: PHOENIX-4724.v1.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data
[ https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463240#comment-16463240 ] Vincent Poon commented on PHOENIX-4724: --- [~aertoria] check it out > Efficient Equi-Depth histogram for streaming data > - > > Key: PHOENIX-4724 > URL: https://issues.apache.org/jira/browse/PHOENIX-4724 > Project: Phoenix > Issue Type: Sub-task >Reporter: Vincent Poon >Assignee: Vincent Poon >Priority: Major > Attachments: PHOENIX-4724.v1.patch > > > Equi-Depth histogram from > http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but > without the sliding window - we assume a single window over the entire data > set. > Used to generate the bucket boundaries of a histogram where each bucket has > the same # of items. > This is useful, for example, for pre-splitting an index table, by feeding in > data from the indexed column. > Works on streaming data - the histogram is dynamically updated for each new > value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)