[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15378296#comment-15378296
 ] 

Tim Hunter commented on SPARK-16283:
------------------------------------

Are we trying to reproduce Hive's results here? In this case, then yes there is 
no choice but port Hive's code. If we just want to have an equivalent result, 
then we can use the following pseudo-python-code:

{code}
def percentile_approx(df, x, num_hist):
  return quantile_approx(df, x, max(1/num_hist, 1e-3) )
{code}

The final result has the advantage over hive to have theoretical bounds on the 
result. The only issue is that the runtime in this case is O(num_hist ^ 2) 
(instead of linear) if I remember correctly.

Also, if we want to spend more time on improving the algorithms, I would prefer 
something that has some known guarantees rather than something completely novel.

> Implement percentile_approx SQL function
> ----------------------------------------
>
>                 Key: SPARK-16283
>                 URL: https://issues.apache.org/jira/browse/SPARK-16283
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to