[ 
https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067167#comment-16067167
 ] 

Andrew Ray commented on SPARK-21184:
------------------------------------

Also the lookup queries are just wrong

{code}
scala> Seq(1, 2).toDF("a").selectExpr("percentile_approx(a, 0.001)").head
res9: org.apache.spark.sql.Row = [2.0]
{code}


> QuantileSummaries implementation is wrong and QuantileSummariesSuite fails 
> with larger n
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-21184
>                 URL: https://issues.apache.org/jira/browse/SPARK-21184
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1
>            Reporter: Andrew Ray
>
> 1. QuantileSummaries implementation does not match the paper it is supposed 
> to be based on.
> 1a. The compress method 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240)
>  merges neighboring buckets, but thats not what the paper says to do. The 
> paper 
> (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) 
> describes an implicit tree structure and the compress method deletes selected 
> subtrees.
> 1b. The paper does not discuss merging these summary data structures at all. 
> The following comment is in the merge method of QuantileSummaries:
> {quote}      // The GK algorithm is a bit unclear about it, but it seems 
> there is no need to adjust the
>       // statistics during the merging: the invariants are still respected 
> after the merge.{quote}
> Unless I'm missing something that needs substantiation, it's not clear that 
> that the invariants hold.
> 2. QuantileSummariesSuite fails with n = 10000 (and other non trivial values)
> https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27
> One possible solution if these issues can't be resolved would be to move to 
> an algorithm that explicitly supports merging and is well tested like 
> https://github.com/tdunning/t-digest



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to