GitHub user wzhfy opened a pull request:

    [SPARK-17997] [SQL] Add an aggregation function for counting distinct 
values for multiple intervals

    ## What changes were proposed in this pull request?
    This work is a part of 
[SPARK-17074]( to generate 
histogram statistics.
    This work is to compute ndv's for bins in equi-height histograms. A bin 
consists of two endpoints which form an interval of values and the ndv in that 
interval. For computing histogram statistics, after getting the endpoints, we 
need an agg function to count distinct values in each interval.
    This pr also refactors HyperLogLogPlusPlus by extracting a helper class 
HyperLogLogPlusPlusAlgo, where I put the real algorithm.
    ## How was this patch tested?
    add test cases

You can merge this pull request into a Git repository by running:

    $ git pull countIntervals

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15544
commit 0255f6cbf8f32bba223c479b27b35a9310e52658
Author: wangzhenhua <>
Date:   2016-10-14T06:23:39Z

    refactor hllpp

commit ebeb0349e1786b6d74706bbf33a335c32a6eda7d
Author: wangzhenhua <>
Date:   2016-10-17T13:18:36Z

    add IntervalDistinctApprox

commit e274ef22b96ca878df326675e28566cfff6b5088
Author: wangzhenhua <>
Date:   2016-10-19T01:58:32Z

    add test cases


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to