GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/16553
[SPARK-9435][SQL] Reuse function in Java UDF to correctly support
expressions that require equality comparison
## What changes were proposed in this pull request?
Currently, running the codes in Java
```java
spark.udf().register("inc", new UDF1<Long, Long>() {
@Override
public Long call(Long i) {
return i + 1;
}
}, DataTypes.LongType);
spark.range(10).toDF("x").createOrReplaceTempView("tmp");
Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head();
Assert.assertEquals(7, result.getLong(0));
```
fails as below:
```
org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither
present in the group by, nor is it an aggregate function. Add to group by or
wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L]
+- SubqueryAlias tmp, `tmp`
+- Project [id#16L AS x#19L]
+- Range (0, 10, step=1, splits=Some(8))
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
```
The root cause is because we were creating the function every time when it
needs to build as below:
```scala
scala> def inc(i: Int) = i + 1
inc: (i: Int)Int
scala> (inc(_: Int)).hashCode
res15: Int = 1231799381
scala> (inc(_: Int)).hashCode
res16: Int = 2109839984
scala> (inc(_: Int)) == (inc(_: Int))
res17: Boolean = false
```
This seems leading to the comparison failure between `ScalaUDF` created
from Java UDF API, for example, in `Expression.semanticEquals`.
In case of Scala one, it seems already fine.
Both can be tested easily as below if any reviewer is comfortable with
Scala:
```scala
val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y")
val javaUDF = new UDF1[Int, Int] {
override def call(i: Int): Int = i + 1
}
// spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for
Java API
// spark.udf.register("inc", (i: Int) => i + 1) // Uncomment this for
Scala API
df.createOrReplaceTempView("tmp")
spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show()
```
## How was this patch tested?
Unit test in `JavaUDFSuite.java`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-9435
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16553.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16553
----
commit 30ed14f38b5c38091d07d0e014a49e494aeb73cc
Author: hyukjinkwon <[email protected]>
Date: 2017-01-11T18:02:08Z
Reuse function in Java UDF to support correctly expression equality
comparison
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]