[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378296#comment-15378296 ]
Tim Hunter commented on SPARK-16283: ------------------------------------ Are we trying to reproduce Hive's results here? In this case, then yes there is no choice but port Hive's code. If we just want to have an equivalent result, then we can use the following pseudo-python-code: {code} def percentile_approx(df, x, num_hist): return quantile_approx(df, x, max(1/num_hist, 1e-3) ) {code} The final result has the advantage over hive to have theoretical bounds on the result. The only issue is that the runtime in this case is O(num_hist ^ 2) (instead of linear) if I remember correctly. Also, if we want to spend more time on improving the algorithms, I would prefer something that has some known guarantees rather than something completely novel. > Implement percentile_approx SQL function > ---------------------------------------- > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org