[GitHub] spark pull request #16553: [SPARK-9435][SQL] Reuse function in Java UDF to c...

HyukjinKwon Wed, 11 Jan 2017 10:21:58 -0800

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/16553


    [SPARK-9435][SQL] Reuse function in Java UDF to correctly support 
expressions that require equality comparison

    ## What changes were proposed in this pull request?
    
    Currently, running the codes in Java
    
    ```java
    spark.udf().register("inc", new UDF1<Long, Long>() {
      @Override
      public Long call(Long i) {
        return i + 1;
      }
    }, DataTypes.LongType);
    
    spark.range(10).toDF("x").createOrReplaceTempView("tmp");
    Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head();
    Assert.assertEquals(7, result.getLong(0));
    ```
    
    fails as below:
    
    ```
    org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither 
present in the group by, nor is it an aggregate function. Add to group by or 
wrap in first() (or first_value) if you don't care which value you get.;;
    Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L]
    +- SubqueryAlias tmp, `tmp`
       +- Project [id#16L AS x#19L]
          +- Range (0, 10, step=1, splits=Some(8))
    
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
        at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
    ```
    
    The root cause is because we were creating the function every time when it 
needs to build as below:
    
    ```scala
    scala> def inc(i: Int) = i + 1
    inc: (i: Int)Int
    
    scala> (inc(_: Int)).hashCode
    res15: Int = 1231799381
    
    scala> (inc(_: Int)).hashCode
    res16: Int = 2109839984
    
    scala> (inc(_: Int)) == (inc(_: Int))
    res17: Boolean = false
    ```
    
    This seems leading to the comparison failure between `ScalaUDF` created 
from Java UDF API, for example, in `Expression.semanticEquals`.
    
    In case of Scala one, it seems already fine.
    
    Both can be tested easily as below if any reviewer is comfortable with 
Scala:
    
    ```scala
    val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y")
    val javaUDF = new UDF1[Int, Int]  {
      override def call(i: Int): Int = i + 1
    }
    // spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for 
Java API
    // spark.udf.register("inc", (i: Int) => i + 1)    // Uncomment this for 
Scala API
    df.createOrReplaceTempView("tmp")
    spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show()
    ```
    
    ## How was this patch tested?
    
    Unit test in `JavaUDFSuite.java`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-9435

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16553.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16553
    
----
commit 30ed14f38b5c38091d07d0e014a49e494aeb73cc
Author: hyukjinkwon <[email protected]>
Date:   2017-01-11T18:02:08Z

    Reuse function in Java UDF to support correctly expression equality 
comparison

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16553: [SPARK-9435][SQL] Reuse function in Java UDF to c...

Reply via email to