GitHub user wzhfy opened a pull request:
https://github.com/apache/spark/pull/19479
[SPARK-17074] [SQL] Generate equi-height histogram in column statistics
## What changes were proposed in this pull request?
Equi-height histogram is effective in cardinality estimation, and more
accurate than basic column stats (min, max, ndv, etc) especially in skew
distribution. So we need to support it.
For equi-height histogram, the heights of all buckets (intervals) are the
same.
In this PR, we use a two-step method to generate an equi-height histogram:
1. use `ApproximatePercentile` to get percentiles `p(1/n), p(2/n) ...
p((n-1)/n)`;
2. use min, max, and percentiles to construct range values of buckets, e.g.
`[min, p(1/n)], [p(1/n), p(2/n)] ... [p((n-1)/n), max]`, and then use
`ApproxCountDistinctForIntervals` to count ndv in each bucket. Each bucket is
of the form: `(lowerBound, higherBound, ndv)`.
## How was this patch tested?
Added new test cases and modified some existing test cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wzhfy/spark generate_histogram
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19479.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19479
----
commit 54c678cd4903e0a8036fca57ed31712402f6d71e
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-11T06:28:00Z
generate equi-height histogram
commit 31a852affc7f359dae01e6a893cffec4caf1235f
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-12T06:35:16Z
add/modify tests
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]