[
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhenhua Wang updated SPARK-17074:
---------------------------------
Description:
Equi-height histogram is effective in handling skewed data distribution.
For equi-height histogram, the heights of all bins(intervals) are the same. The
default number of bins we use is 254.
Now we use a two-step method to generate an equi-height histogram:
1. use percentile_approx to get percentiles (end points of the equi-height bin
intervals);
2. use a new aggregate function to count ndv in each of these bins.
Note that this method takes two table scans. We may provide other algorithms
which takes only one table scan in the future.
was:
Equi-height histogram is effective in handling skewed data distribution.
For equi-height histogram, the heights of all bins(intervals) are the same. The
default number of bins we use is 254.
We first use [SPARK-18000] to compute equi-width histograms (for both numeric
and string types) or endpoints of equi-height histograms (for numeric type
only). Then, if we get endpoints of a equi-height histogram, we need to compute
ndv's between those endpoints by [SPARK-17997] to form the equi-height
histogram.
This Jira incorporates three Jiras mentioned above to support needed
aggregation functions. We need to resolve them before this one.
> generate equi-height histogram for column
> -----------------------------------------
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
> Issue Type: Sub-task
> Components: Optimizer
> Affects Versions: 2.0.0
> Reporter: Ron Hu
>
> Equi-height histogram is effective in handling skewed data distribution.
> For equi-height histogram, the heights of all bins(intervals) are the same.
> The default number of bins we use is 254.
> Now we use a two-step method to generate an equi-height histogram:
> 1. use percentile_approx to get percentiles (end points of the equi-height
> bin intervals);
> 2. use a new aggregate function to count ndv in each of these bins.
> Note that this method takes two table scans. We may provide other algorithms
> which takes only one table scan in the future.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]