GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/19479

    [SPARK-17074] [SQL] Generate equi-height histogram in column statistics

    ## What changes were proposed in this pull request?
    
    Equi-height histogram is effective in cardinality estimation, and more 
accurate than basic column stats (min, max, ndv, etc) especially in skew 
distribution. So we need to support it.
    
    For equi-height histogram, the heights of all buckets (intervals) are the 
same.
    In this PR, we use a two-step method to generate an equi-height histogram:
    1. use `ApproximatePercentile` to get percentiles `p(1/n), p(2/n) ... 
p((n-1)/n)`;
    2. use min, max, and percentiles to construct range values of buckets, e.g. 
`[min, p(1/n)], [p(1/n), p(2/n)] ... [p((n-1)/n), max]`, and then use 
`ApproxCountDistinctForIntervals` to count ndv in each bucket. Each bucket is 
of the form: `(lowerBound, higherBound, ndv)`.
    
    ## How was this patch tested?
    
    Added new test cases and modified some existing test cases.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark generate_histogram

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19479.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19479
    
----
commit 54c678cd4903e0a8036fca57ed31712402f6d71e
Author: Zhenhua Wang <[email protected]>
Date:   2017-10-11T06:28:00Z

    generate equi-height histogram

commit 31a852affc7f359dae01e6a893cffec4caf1235f
Author: Zhenhua Wang <[email protected]>
Date:   2017-10-12T06:35:16Z

    add/modify tests

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to