[ 
https://issues.apache.org/jira/browse/SPARK-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15610192#comment-15610192
 ] 

Zhenhua Wang edited comment on SPARK-18111 at 10/27/16 12:46 AM:
-----------------------------------------------------------------

[~srowen] The minimum is not only skipped once in the whole data, but skipped 
per partition.
For example, we have two partitions of data: (1, 1, 3, 3) and (5, 5, 7, 7), 
then when we do global merging, the samples in QuantileSummaries is (1, 3, 3, 
5, 7, 7), and the percentiles result returned for query percentile_approx(0.25, 
0.5, 0.75) is (3.0, 5.0, 7.0), but the correct answer should be (1.0, 3.0, 
5.0). Of course we can say it's an approximate algorithm, but this error is 
already *beyond the error bound which the algo provides*. And also, we can make 
the error even larger if we construct more such partitions and thus more 
skipped minimum elements.


was (Author: zenwzh):
[~srowen] The minimum is not only skipped once in the whole data, but skipped 
per partition.
For example, we have two partitions of data: (1, 1, 3, 3) and (5, 5, 7, 7), 
then when we do global merging, the samples in QuantileSummaries is (1, 3, 3, 
5, 7, 7), and the percentiles result returned for query percentile_approx(0.25, 
0.5, 0.75) is (3.0, 5.0, 7.0), but the correct answer should be (1.0, 3.0, 
5.0). Of course we can say it's an approximate algorithm, but this error is 
already "beyond the error bound which the algo provides". And also, we can make 
the error even larger if we construct more such partitions and thus more 
skipped minimum elements.

> Wrong ApproximatePercentile answer when multiple records have the minimum 
> value
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18111
>                 URL: https://issues.apache.org/jira/browse/SPARK-18111
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.1
>            Reporter: Zhenhua Wang
>
> When multiple records have the minimum value, the answer of 
> ApproximatePercentile is wrong.
> For example, the following query returns 2.0 for percentile 0.5, but the 
> correct answer should be 1.0
> 0: jdbc:hive2://localhost:10000> select key from src2;
> +------+--+
> | key  |
> +------+--+
> | 1    |
> | 1    |
> | 2    |
> | 2    |
> +------+--+
> 4 rows selected (0.185 seconds)
> 0: jdbc:hive2://localhost:10000> select percentile_approx(key, array(0.5)) 
> from src2;
> +------------------------------------------------------------+--+
> | percentile_approx(CAST(key AS DOUBLE), array(0.5), 10000)  |
> +------------------------------------------------------------+--+
> | [2.0]                                                      |
> +------------------------------------------------------------+--+
> 1 row selected (0.292 seconds)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to