GitHub user wzhfy opened a pull request:
https://github.com/apache/spark/pull/19438
[SPARK-22208] [SQL] Improve percentile_approx by not rounding up
targetError and starting from index 0
## What changes were proposed in this pull request?
Currently percentile_approx never returns the first element when percentile
is in (relativeError, 1/N], where relativeError default 1/10000, and N is the
total number of elements. But ideally, percentiles in [0, 1/N] should all
return the first element as the answer.
For example, given input data 1 to 10, if a user queries 10% (or even less)
percentile, it should return 1, because the first value 1 already reaches 10%.
Currently it returns 2.
Based on the paper, targetError is not rounded up, and searching index
should start from 0 instead of 1. By following the paper, we should be able to
fix the cases mentioned above.
## How was this patch tested?
Added a new test case and fix existing test cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wzhfy/spark improve_percentile_approx
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19438.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19438
----
commit 24f8295498a7ad6d2d99ea27a196ccf154165907
Author: Zhenhua Wang <[email protected]>
Date: 2017-09-30T16:04:32Z
return the first element for small percentage
commit 8c8c22dbebe99def6127b49988dfc4f886797bd6
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-02T10:24:28Z
fix test
commit dbc3d47b0a56113032d2a4565180932e4ef26219
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-02T14:53:04Z
fix test
commit 9815ce8e17e34422f8c915d115061a9635abd119
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-03T14:51:55Z
fix pyspark test
commit f2b153800ebdf10999d4a8bb3578101a12f6d631
Author: Zhenhua Wang <[email protected]>
Date: 2017-10-05T15:47:27Z
follow the paper and fix sparkR test
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]