[jira] [Commented] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n

Timothy Hunter (JIRA) Thu, 31 Aug 2017 13:00:45 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149510#comment-16149510
 ]


Timothy Hunter commented on SPARK-21184:
----------------------------------------

[~a1ray] thank you for the report, someone should investigate about these given 
values.

You raise some valid questions about the choice of data structures and 
algorithm, which were discussed during the implementation and that can 
certainly be revisited:

- tree structures: the major constraint here is that this structure gets 
serialized often, due to how UDAFs work. This is why the current implementation 
is amortized over multiple records. Edo Liberty has published some recent work 
that is relevant in that area.

- algorithm: we looked at t-digest (and q-digest). The main concern back then 
was that there was no published worst-time guarantee given a target precision. 
This is still the case to my knowledge. Because of that, it is hard to 
understand what could happen in some unusual cases - which tend to be not so 
unusual in big data. That being said, it looks like it is a popular and 
well-maintained choice now, so I am certainly open to relaxing this constraint.

> QuantileSummaries implementation is wrong and QuantileSummariesSuite fails 
> with larger n
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-21184
>                 URL: https://issues.apache.org/jira/browse/SPARK-21184
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1
>            Reporter: Andrew Ray
>
> 1. QuantileSummaries implementation does not match the paper it is supposed 
> to be based on.
> 1a. The compress method 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240)
>  merges neighboring buckets, but thats not what the paper says to do. The 
> paper 
> (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) 
> describes an implicit tree structure and the compress method deletes selected 
> subtrees.
> 1b. The paper does not discuss merging these summary data structures at all. 
> The following comment is in the merge method of QuantileSummaries:
> {quote}      // The GK algorithm is a bit unclear about it, but it seems 
> there is no need to adjust the
>       // statistics during the merging: the invariants are still respected 
> after the merge.{quote}
> Unless I'm missing something that needs substantiation, it's not clear that 
> that the invariants hold.
> 2. QuantileSummariesSuite fails with n = 10000 (and other non trivial values)
> https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27
> One possible solution if these issues can't be resolved would be to move to 
> an algorithm that explicitly supports merging and is well tested like 
> https://github.com/tdunning/t-digest



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n

Reply via email to