[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17074:
---------------------------------
    Description: 
Equi-height histogram is effective in handling skewed data distribution.

For equi-height histogram, the heights of all bins(intervals) are the same. The 
default number of bins we use is 254.

Now we use a two-step method to generate an equi-height histogram:
1. use percentile_approx to get percentiles (end points of the equi-height bin 
intervals);
2. use a new aggregate function to count ndv in each of these bins.

Note that this method takes two table scans. In the future we may provide other 
algorithms which need only one table scan.

  was:
Equi-height histogram is effective in handling skewed data distribution.

For equi-height histogram, the heights of all bins(intervals) are the same. The 
default number of bins we use is 254.

Now we use a two-step method to generate an equi-height histogram:
1. use percentile_approx to get percentiles (end points of the equi-height bin 
intervals);
2. use a new aggregate function to count ndv in each of these bins.

Note that this method takes two table scans. We may provide other algorithms 
which takes only one table scan in the future.


> generate equi-height histogram for column
> -----------------------------------------
>
>                 Key: SPARK-17074
>                 URL: https://issues.apache.org/jira/browse/SPARK-17074
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Optimizer
>    Affects Versions: 2.0.0
>            Reporter: Ron Hu
>
> Equi-height histogram is effective in handling skewed data distribution.
> For equi-height histogram, the heights of all bins(intervals) are the same. 
> The default number of bins we use is 254.
> Now we use a two-step method to generate an equi-height histogram:
> 1. use percentile_approx to get percentiles (end points of the equi-height 
> bin intervals);
> 2. use a new aggregate function to count ndv in each of these bins.
> Note that this method takes two table scans. In the future we may provide 
> other algorithms which need only one table scan.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to