zero323 commented on a change in pull request #27278:
[SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions.
URL: https://github.com/apache/spark/pull/27278#discussion_r368969058
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/functions.scala
##########
@@ -652,6 +652,122 @@ object functions {
*/
def min(columnName: String): Column = min(Column(columnName))
+ /**
+ * Aggregate function: Returns and array of the approximate percentile values
+ * of numeric column col at the given percentages.
+ *
+ * Each value of the percentage array must be between 0.0 and 1.0.
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(e: Column, percentage: Array[Double], accuracy: Long):
Column = {
+ withAggregateFunction {
+ new ApproximatePercentile(
+ e.expr, typedLit(percentage).expr, lit(accuracy).expr
+ )
+ }
+ }
+
+ /**
+ * Aggregate function: Returns and array of the approximate percentile values
+ * of numeric column col at the given percentages.
+ *
+ * Each value of the percentage array must be between 0.0 and 1.0.
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(columnName: String, percentage: Array[Double],
accuracy: Long): Column = {
+ percentile_approx(Column(columnName), percentage, accuracy)
+ }
+
+ /**
+ * Aggregate function: Returns and array of the approximate percentile values
+ * of numeric column col at the given percentages.
+ *
+ * Each value of the percentage array must be between 0.0 and 1.0.
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(e: Column, percentage: Seq[Double], accuracy: Long):
Column = {
+ percentile_approx(e, percentage.toArray, accuracy)
+ }
+
+ /**
+ * Aggregate function: Returns and array of the approximate percentile values
+ * of numeric column col at the given percentages.
+ *
+ * Each value of the percentage array must be between 0.0 and 1.0.
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(columnName: String, percentage: Seq[Double], accuracy:
Long): Column = {
+ percentile_approx(Column(columnName), percentage.toArray, accuracy)
+ }
+
+ /**
+ * Aggregate function: Returns the approximate percentile value of numeric
+ * column col at the given percentage.
+ *
+ * The value of percentage must be between 0.0 and 1.0.\
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(e: Column, percentage: Double, accuracy: Long): Column
= {
+ withAggregateFunction {
+ new ApproximatePercentile(
+ e.expr, lit(percentage).expr, lit(accuracy).expr
+ )
+ }
+ }
+
+ /**
+ * Aggregate function: Returns the approximate percentile value of numeric
+ * column col at the given percentage.
+ *
+ * The value of percentage must be between 0.0 and 1.0.\
+ *
+ * The accuracy parameter is a positive numeric literal
+ * which controls approximation accuracy at the cost of memory.
+ * Higher value of accuracy yields better accuracy, 1.0/accuracy
+ * is the relative error of the approximation.
+ *
+ * @group agg_funcs
+ * @since 3.0.0
+ */
+ def percentile_approx(columnName: String, percentage: Double, accuracy:
Long): Column = {
Review comment:
To be honest I am not very enthusiastic about it and I am not even convinced
that it is consistent with the rest of `functions`.
The closest equivalents we have are
- `approx_count_distinct` with `rsd`
- `last` with `ignoreNulls`
and both use external types, not columns. Not to mention this is still
counter-intuitive and painful to use though:
> we don't need to duplicate docs with less maintenance.
is a fair point.
- I can easily remove `Seq` variants, that's for sure and cut number of
signatures by two, leaving us with four.
- If having not `Column` variant on JVM is fine, we can drop `(String, _, _)
=> Column` variant so that brings us to two variants.
- It is also not hard to build `Column` objects transparently for Python and
R users to support `(Column, Column, Column) => Column`. But I am still
concerned about confusing semantics.
If two variants are still to much, we could always have `(Column, Any,
Double) => Column` ‒ `o.a.sql.functions` is already quite full of `Any`s.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]