[
https://issues.apache.org/jira/browse/SPARK-15251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281638#comment-15281638
]
Viacheslav Saevskiy commented on SPARK-15251:
---------------------------------------------
The only workaround I found is to
{code}
sqlContext.sql("SELECT Sum(x) t FROM
my_data").cache().registerTempTable("my_data2")
{code}
and then
{code}
sqlContext.sql("SELECT timesTwo(t) FROM my_data2").show()
{code}
*cache* will evaluate RDD so that two function can work.
It's seems that python UDF doesn't work with aggregated function (as well as
build-in)
neither
{code}
SELECT timesTwo(Sum(x)) t FROM my_data
{code}
nor
{code}
SELECT Sum(timesTwo(x)) t FROM my_data
{code}
[~davies] Can you suggest how to fix it?
> Cannot apply PythonUDF to aggregated column
> -------------------------------------------
>
> Key: SPARK-15251
> URL: https://issues.apache.org/jira/browse/SPARK-15251
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.6.1
> Reporter: Matthew Livesey
>
> In scala it is possible to define a UDF an apply it to an aggregated value in
> an expression, for example:
> {code}
> def timesTwo(x: Int): Int = x * 2
> sqlContext.udf.register("timesTwo", timesTwo _)
> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
> case class Data(x: Int, y: String)
> val data = List(Data(1, "a"), Data(2, "b"))
> val rdd = sc.parallelize(data)
> val df = rdd.toDF
> df.registerTempTable("my_data")
> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
> +---+
> | t|
> +---+
> | 6|
> +---+
> {code}
> Performing the same computation in pyspark:
> {code}
> def timesTwo(x):
> return x * 2
> sqlContext.udf.register("timesTwo", timesTwo)
> data = [(1, 'a'), (2, 'b')]
> rdd = sc.parallelize(data)
> df = sqlContext.createDataFrame(rdd, ["x", "y"])
> df.registerTempTable("my_data")
> sqlContext.sql("SELECT timesTwo(Sum(x)) t FROM my_data").show()
> {code}
> Gives the following:
> {code}
> AnalysisException: u"expression 'pythonUDF' is neither present in the group
> by, nor is it an aggregate function. Add to group by or wrap in first() (or
> first_value) if you don't care which value you get.;"
> {code}
> Using a lambda rather than a named function gives the same error.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]