[ https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-17074. --------------------------------- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19479 [https://github.com/apache/spark/pull/19479] > generate equi-height histogram for column > ----------------------------------------- > > Key: SPARK-17074 > URL: https://issues.apache.org/jira/browse/SPARK-17074 > Project: Spark > Issue Type: Sub-task > Components: Optimizer > Affects Versions: 2.3.0 > Reporter: Ron Hu > Fix For: 2.3.0 > > > Equi-height histogram is effective in handling skewed data distribution. > For equi-height histogram, the heights of all bins(intervals) are the same. > The default number of bins we use is 254. > Now we use a two-step method to generate an equi-height histogram: > 1. use percentile_approx to get percentiles (end points of the equi-height > bin intervals); > 2. use a new aggregate function to get distinct counts in each of these bins. > Note that this method takes two table scans. In the future we may provide > other algorithms which need only one table scan. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org