[
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374295#comment-15374295
]
Liwei Lin edited comment on SPARK-16283 at 7/13/16 6:05 AM:
------------------------------------------------------------
Hive's percentile_approx implementation computes approximate percentile values
from a histogram (please refer to
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
and
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally
specified by users
- if the number of unique values in the actual dataset is less than or equals
to this \[nb\], we can expect an exact result; otherwise there are no
approximation guarantees
Our Dataset's approxQuantile() implementation is not really histogram-based
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like:
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our
approximation is deterministicly bounded by this relativeError -- please refer
to
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
for details
Since there's no direct deterministic relationship between \[nb\] and
relativeError, it seems hard to build Hive's percentile_approx on top of our
Dataset's approxQuantile(). So should we: (a) port Hive' implementation into
Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on top of it, or (b) provide
{{\_FUNC\_(expr, pc, relativeError)}} directly on top of our Dataset's
approxQuantile() implementation, but this might be incompatible with Hive?
[~rxin], [~thunterdb] could you share some thoughts? Thanks !
was (Author: proflin):
Hive's percentile_approx implementation computes approximate percentile values
from a histogram (please refer to
[Hive/GenericUDAFPercentileApprox.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java]
and
[Hive/NumericHistogram.java|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java]
for details):
- Hive's percentile_approx's signature is: {{\_FUNC\_(expr, pc, \[nb\])}}
- parameter \[nb\] -- the number of histogram bins to use -- is optionally
specified by users
- if the number of unique values in the actual dataset is less than or equals
to this \[nb\], we can expect an exact result; otherwise there are no
approximation guarantees
Our Dataset's approxQuantile() implementation is not really histogram-based
(and thus differs from Hive's implementation):
- our Dataset's approxQuantile()'s signature is something like:
{{\_FUNC\_(expr, pc, relativeError)}}
- parameter relativeError is specified by users and should be in \[0, 1\]; our
approximation is deterministicly bounded by this relativeError -- please refer
to
[Spark/DataFrameStatFunctions.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L39]
for details
Since there's no direct deterministic relationship between \[nb\] and
relativeError, it seems hard to build Hive's percentile_approx on top of our
Dataset's approxQuantile(). So, [~rxin], [~thunterdb], should we: (a) port
Hive' implementation into Spark, and provide {{\_FUNC\_(expr, pc, \[nb\])}} on
top of it, or (b) provide {{\_FUNC\_(expr, pc, relativeError)}} directly on top
of our Dataset's approxQuantile() implementation, but this might be
incompatible with Hive? Could you share some thoughts? Thanks !
> Implement percentile_approx SQL function
> ----------------------------------------
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Reynold Xin
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]