[GitHub] [spark] maropu commented on a change in pull request #31103: [SPARK-34002][SQL] Fix the usage of encoder in ScalaUDF

GitBox Sat, 09 Jan 2021 20:17:53 -0800


maropu commented on a change in pull request #31103:
URL: https://github.com/apache/spark/pull/31103#discussion_r554406326




##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
##########
@@ -1982,8 +1982,34 @@ class DatasetSuite extends QueryTest
       assert(timezone == "Asia/Shanghai")
     }
   }
+
+  test("SPARK-34002: Fix broken Option input/output in UDF") {
+    def f1(bar: Bar): Option[Bar] = {
+      None
+    }
+
+    def f2(bar: Option[Bar]): Option[Bar] = {
+      bar
+    }
+
+    val udf1: UserDefinedFunction = udf(f1 _).withName("f1")
+    val udf2: UserDefinedFunction = udf(f2 _).withName("f2")

Review comment:
       nit: `val udf1 = udf(f1 _).withName("f1")`

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
##########
@@ -1982,8 +1982,34 @@ class DatasetSuite extends QueryTest
       assert(timezone == "Asia/Shanghai")
     }
   }
+
+  test("SPARK-34002: Fix broken Option input/output in UDF") {
+    def f1(bar: Bar): Option[Bar] = {
+      None
+    }
+
+    def f2(bar: Option[Bar]): Option[Bar] = {
+      bar
+    }
+
+    val udf1: UserDefinedFunction = udf(f1 _).withName("f1")
+    val udf2: UserDefinedFunction = udf(f2 _).withName("f2")
+
+    val df = (1 to 2).map(i => Tuple1(Bar(1))).toDF("c0")
+    val newDf = df
+      .withColumn("c1", udf1(col("c0")))
+      .withColumn("c2", udf2(col("c1")))
+    val schema = newDf.schema
+    assert(schema == StructType(

Review comment:
       `assert(newDf.schema == StructType(`

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
##########
@@ -304,8 +304,13 @@ case class ExpressionEncoder[T](
     StructField(s.name, s.dataType, s.nullable)
   })
 
+  /**
+   * This is used for `ScalaUDF` (see `UDFRegistration`). As the serialization 
in `ScalaUDF` is for
+   * individual column, not the whole row, we just take the data type of 
vanilla object serializer,
+   * not `serializer` which is transformed somehow.
+   */
   def dataTypeAndNullable: Schema = {
-    val dataType = if (isSerializedAsStruct) schema else schema.head.dataType
+    val dataType = objSerializer.dataType
     Schema(dataType, objSerializer.nullable)

Review comment:
       Btw, we don't need to move this thin helper func into the 
`UDFRegistration` side? It seems this func exists now for the specific use for 
udf-related code.

##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
##########
@@ -304,8 +304,13 @@ case class ExpressionEncoder[T](
     StructField(s.name, s.dataType, s.nullable)
   })
 
+  /**
+   * This is used for `ScalaUDF` (see `UDFRegistration`). As the serialization 
in `ScalaUDF` is for
+   * individual column, not the whole row, we just take the data type of 
vanilla object serializer,
+   * not `serializer` which is transformed somehow.
+   */
   def dataTypeAndNullable: Schema = {
-    val dataType = if (isSerializedAsStruct) schema else schema.head.dataType
+    val dataType = objSerializer.dataType
     Schema(dataType, objSerializer.nullable)

Review comment:
       nit: Schema(objSerializer.dataType, objSerializer.nullable)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] maropu commented on a change in pull request #31103: [SPARK-34002][SQL] Fix the usage of encoder in ScalaUDF

Reply via email to