rwpenney opened a new pull request #30745: URL: https://github.com/apache/spark/pull/30745
### Why is this change being proposed? This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group. This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interests, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other Spark users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark. This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative. ### Does this PR introduce _any_ user-facing change? No - only adds new function. ### How was this patch tested? Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. An illustration of the new functionality, within PySpark is as follows: ``` import pyspark.sql.functions as pf, pyspark.sql.window as pw df = sqlContext.range(1, 17).toDF("x") win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x")) df.withColumn("factorial", pf.product("x").over(win)).show(20, False) +---+---------------+ |x |factorial | +---+---------------+ |1 |1.0 | |2 |2.0 | |3 |6.0 | |4 |24.0 | |5 |120.0 | |6 |720.0 | |7 |5040.0 | |8 |40320.0 | |9 |362880.0 | |10 |3628800.0 | |11 |3.99168E7 | |12 |4.790016E8 | |13 |6.2270208E9 | |14 |8.71782912E10 | |15 |1.307674368E12 | |16 |2.0922789888E13| +---+---------------+ ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
