[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487916#comment-16487916
 ] 

Hudson commented on PHOENIX-4724:
-

FAILURE: Integrated in Jenkins build PreCommit-PHOENIX-Build #1885 (See 
[https://builds.apache.org/job/PreCommit-PHOENIX-Build/1885/])
PHOENIX-4724 Efficient Equi-Depth histogram for streaming data (vincentpoon: 
rev cb17adbbde56cacd43846ead2200e6606ed64ae8)
* (add) 
phoenix-core/src/test/java/org/apache/phoenix/util/EquiDepthStreamHistogramTest.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/util/EquiDepthStreamHistogram.java


> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Affects Versions: 4.14.0
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Fix For: 4.14.0, 5.0.0
>
> Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-18 Thread Vincent Poon (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481431#comment-16481431
 ] 

Vincent Poon commented on PHOENIX-4724:
---

Pushed to 4.x-cdh5.11 also

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Affects Versions: 4.14.0
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Fix For: 4.14.0, 5.0.0
>
> Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-18 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481371#comment-16481371
 ] 

James Taylor commented on PHOENIX-4724:
---

Please don't forget the 4.x-cdh5.11 branch, [~vincentpoon]. Note that's the 
only cdh branch you should commit to.

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Affects Versions: 4.14.0
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Fix For: 4.14.0, 5.0.0
>
> Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481155#comment-16481155
 ] 

Hudson commented on PHOENIX-4724:
-

FAILURE: Integrated in Jenkins build Phoenix-4.x-HBase-1.3 #137 (See 
[https://builds.apache.org/job/Phoenix-4.x-HBase-1.3/137/])
PHOENIX-4724 Efficient Equi-Depth histogram for streaming data (vincentpoon: 
rev 5935edd71873f9ec766ffe35000e96d2e48d)
* (add) 
phoenix-core/src/test/java/org/apache/phoenix/util/EquiDepthStreamHistogramTest.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/util/EquiDepthStreamHistogram.java


> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Affects Versions: 4.14.0
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Fix For: 4.14.0, 5.0.0
>
> Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481096#comment-16481096
 ] 

Hudson commented on PHOENIX-4724:
-

SUCCESS: Integrated in Jenkins build Phoenix-4.x-HBase-0.98 #1897 (See 
[https://builds.apache.org/job/Phoenix-4.x-HBase-0.98/1897/])
PHOENIX-4724 Efficient Equi-Depth histogram for streaming data (vincentpoon: 
rev 865eb9a5362a0273cb85f6370b4470f03102a05a)
* (add) 
phoenix-core/src/test/java/org/apache/phoenix/util/EquiDepthStreamHistogramTest.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/util/EquiDepthStreamHistogram.java


> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Affects Versions: 4.14.0
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Fix For: 4.14.0, 5.0.0
>
> Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-17 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480191#comment-16480191
 ] 

James Taylor commented on PHOENIX-4724:
---

+1. Excellent work, [~vincentpoon] !

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Affects Versions: 4.15.0
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-09 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469198#comment-16469198
 ] 

Maryann Xue commented on PHOENIX-4724:
--

Yes, I agree with [~jamestaylor] that this information can be useful for the 
query optimizer. Right now for WHERE clause conditions, other than those 
filters on the primary key, we can only have a very rough "guess" of the number 
of rows/bytes of the filtered output. This information can definitely give a 
more accurate estimation for the filter conditions on columns covered by the 
histogram. For example, for a range or equal condition on such columns, we can 
estimate the filtered rows/bytes by calculating (number of buckets that fall in 
the range / number of total buckets).

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Affects Versions: 4.15.0
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Attachments: PHOENIX-4724.v1.patch, PHOENIX-4724.v2.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-07 Thread Vincent Poon (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466452#comment-16466452
 ] 

Vincent Poon commented on PHOENIX-4724:
---

[~jamestaylor] I wrote this, I forgot to add the Apache license - will do that 
for the next revision.

Current use case is the parent Jira PHOENIX-4704, for pre-splitting an index 
table.  In that Jira i plan to scan or sample the data table, generating the 
index rowkey values and feeding them into this histogram.  Then afterwards I 
can use the histogram bounds to create the index table with the proper splits.  
I'm thinking will be done in the IndexTool, though we can possibly put it in 
createTableInternal somewhere as an option as well.

In the future we could also add a table option to create this histogram at 
compaction time, and maintain it in memory.  There's still work to be done:
 * I haven't investigated update/deletes yet, which [~aertoria] also inquired 
about.  Right now it only supports adding values, and can't distinguish updates 
from inserts (I think to do that we would need a count-min sketch or counting 
bloom filter implementation)
 * need to add functionality to be able to merge multiple histograms (e.g. from 
multiple different regions).  

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Attachments: PHOENIX-4724.v1.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-07 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466376#comment-16466376
 ] 

James Taylor commented on PHOENIX-4724:
---

Great stuff, [~vincentpoon]. Couple of questions:
 * Does the histogram building take into account when a value is being 
overwritten again and again (since at write time we don't know if we're 
overwriting or not, unless the table is declared immutable)? Same question on 
delete of a column value. 
 * How would you envision this be integrated with Phoenix? Maybe we could 
generate it when we update stats: at table compaction time and when UPDATE 
STATISTICS is called? Or depending on answers to above, would we try to 
constantly update the histogram as the data is mutating?
 * This information would definitely be useful for the optimizer. We have 
PHOENIX-1178, but not much detail there. [~maryannxue] could likely fill in how 
this information could be used. One big area is in estimating how much data 
will be filtered with a WHERE clause. Another is in how distinct a column is.
 * Did you write EquiDepthStreamHistogram.java or did you find it somewhere 
online? If that latter, how was it licensed?

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Attachments: PHOENIX-4724.v1.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-07 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466265#comment-16466265
 ] 

Ethan Wang commented on PHOENIX-4724:
-

[~xucang]

correct. But the distribution info is used at the moment of when splitting 
happens if I'm not mistaken. 

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Attachments: PHOENIX-4724.v1.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-07 Thread Xu Cang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466121#comment-16466121
 ] 

Xu Cang commented on PHOENIX-4724:
--

[~aertoria]

I am not speaking for Vincent, but my understanding is, this method will be 
used when a user wants to build an index. This is a one-time effort based on 
current table situation (or you can call it a snapshot). So there is no use 
case requires removeValue() in this building index scenario. 

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Attachments: PHOENIX-4724.v1.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-04 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464280#comment-16464280
 ] 

Ethan Wang commented on PHOENIX-4724:
-

[~vincentpoon] I see. What happens when data table got mutated, what's the 
strategy for index table to sync up today.

With that thinking, at class EquiDepthStreamHistogram, besides addValue(), does 
this algorithm support removeValue() as well?

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Attachments: PHOENIX-4724.v1.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-03 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463382#comment-16463382
 ] 

Ethan Wang commented on PHOENIX-4724:
-

[~vincentpoon]

If I understand correctly, with this feature implemented, when you build index 
table, you will at same time record some info into this histogram, so that in 
the future at some point you will conveniently get the distribution info of the 
index table. correct?

So do you store a histogram obj for each index table like a shadow obj some 
where off line? Also, will there every be case that you need mutate index or 
remove index from a existing index table?

Cool idea!

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Attachments: PHOENIX-4724.v1.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4724) Efficient Equi-Depth histogram for streaming data

2018-05-03 Thread Vincent Poon (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463240#comment-16463240
 ] 

Vincent Poon commented on PHOENIX-4724:
---

[~aertoria] check it out

> Efficient Equi-Depth histogram for streaming data
> -
>
> Key: PHOENIX-4724
> URL: https://issues.apache.org/jira/browse/PHOENIX-4724
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: Vincent Poon
>Assignee: Vincent Poon
>Priority: Major
> Attachments: PHOENIX-4724.v1.patch
>
>
> Equi-Depth histogram from 
> http://web.cs.ucla.edu/~zaniolo/papers/Histogram-EDBT2011-CamReady.pdf, but 
> without the sliding window - we assume a single window over the entire data 
> set.
> Used to generate the bucket boundaries of a histogram where each bucket has 
> the same # of items.
> This is useful, for example, for pre-splitting an index table, by feeding in 
> data from the indexed column.
> Works on streaming data - the histogram is dynamically updated for each new 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)