[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

Liwei Lin (JIRA) Tue, 12 Jul 2016 23:05:31 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374295#comment-15374295
 ]


Liwei Lin edited comment on SPARK-16283 at 7/13/16 6:05 AM:
------------------------------------------------------------

Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So should we: (a) port Hive' implementation into 
Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on top of it, or (b) provide 
{{\_FUNC\_(expr, pc, relativeError)}} directly on top of our Dataset's 
approxQuantile() implementation, but this might be incompatible with Hive? 
[~rxin], [~thunterdb] could you share some thoughts? Thanks !


was (Author: proflin):
Hive's percentile_approx implementation computes approximate percentile values 
from a histogram (please refer to 
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
 and 
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
 for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally 
specified by users
- if the number of unique values in the actual dataset is less than or equals 
to this \[nb\], we can expect an exact result; otherwise there are no 
approximation guarantees


Our Dataset's approxQuantile() implementation is not really histogram-based 
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like: 
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our 
approximation is deterministicly bounded by this relativeError -- please refer 
to 
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
 for details


Since there's no direct deterministic relationship between \[nb\] and 
relativeError, it seems hard to build Hive's percentile_approx on top of our 
Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port 
Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on 
top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top 
of our Dataset's approxQuantile() implementation, but this might be 
incompatible with Hive? Could you share some thoughts? Thanks !

> Implement percentile_approx SQL function
> ----------------------------------------
>
>                 Key: SPARK-16283
>                 URL: https://issues.apache.org/jira/browse/SPARK-16283
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

Reply via email to