[ https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067167#comment-16067167 ]
Andrew Ray commented on SPARK-21184: ------------------------------------ Also the lookup queries are just wrong {code} scala> Seq(1, 2).toDF("a").selectExpr("percentile_approx(a, 0.001)").head res9: org.apache.spark.sql.Row = [2.0] {code} > QuantileSummaries implementation is wrong and QuantileSummariesSuite fails > with larger n > ---------------------------------------------------------------------------------------- > > Key: SPARK-21184 > URL: https://issues.apache.org/jira/browse/SPARK-21184 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.1 > Reporter: Andrew Ray > > 1. QuantileSummaries implementation does not match the paper it is supposed > to be based on. > 1a. The compress method > (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240) > merges neighboring buckets, but thats not what the paper says to do. The > paper > (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) > describes an implicit tree structure and the compress method deletes selected > subtrees. > 1b. The paper does not discuss merging these summary data structures at all. > The following comment is in the merge method of QuantileSummaries: > {quote} // The GK algorithm is a bit unclear about it, but it seems > there is no need to adjust the > // statistics during the merging: the invariants are still respected > after the merge.{quote} > Unless I'm missing something that needs substantiation, it's not clear that > that the invariants hold. > 2. QuantileSummariesSuite fails with n = 10000 (and other non trivial values) > https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27 > One possible solution if these issues can't be resolved would be to move to > an algorithm that explicitly supports merging and is well tested like > https://github.com/tdunning/t-digest -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org