sitegui commented on issue #26029: [SPARK-29336][SQL] Fix the implementation of 
QuantileSummaries.merge (guarantee that the relativeError will be respected)
URL: https://github.com/apache/spark/pull/26029#issuecomment-538633749
 
 
   [This 
test](https://github.com/apache/spark/blob/8556710409d9f2fbaee9dbf76a2ea70218316693/sql/core/src/test/scala/org/apache/spark/sql/ApproximatePercentileQuerySuite.scala#L124-L142)
 breaks, but I guess it was not testing the right thing.
   
   It checks if with a higher accuracy we get a lower error for a specific 
query. The issue is that the algorithm guarantees that the maximum error for 
all queries will reduce with accuracy: it is bounded by `count / accuracy`. 
However, it can be the case that for some queries the measured accuracy is much 
higher than that maximum.
   
   The test in question does a single query against a dataset of 1000 elements, 
with accuracies 1, 10, 100, 1000 and 10000. Before the patch the measured error 
was `249, 97, 9, 1, 0` and now it is `249, 40, 0, 1, 0`. Both respect the 
maximum errors of `1000, 100, 10, 1, 0`, we just got lucky in the third case 
now.
   
   I'll modify this test case to test for the respect of the reducing maximum 
bound instead. If you have other suggestions, please let me know.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to