[GitHub] [spark] Ngone51 commented on a change in pull request #28311: [SPARK-31010][SQL][DOC][FOLLOW-UP] Improve deprecated warning message for untyped scala udf

GitBox Thu, 23 Apr 2020 01:30:27 -0700


Ngone51 commented on a change in pull request #28311:
URL: https://github.com/apache/spark/pull/28311#discussion_r413618292




##########
File path: docs/sql-migration-guide.md
##########
@@ -81,7 +81,7 @@ license: |
 
   - In Spark version 2.4 and below, you can create a map with duplicated keys 
via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of 
map with duplicated keys is undefined, for example, map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. In Spark 3.0, Spark 
throws `RuntimeException` when duplicated keys are found. You can set 
`spark.sql.mapKeyDedupPolicy` to `LAST_WIN` to deduplicate map keys with last 
wins policy. Users may still read map values with duplicated keys from data 
sources which do not enforce it (for example, Parquet), the behavior is 
undefined.
 
-  - In Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` 
is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true 
to keep using it. In Spark version 2.4 and below, if 
`org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure 
with primitive-type argument, the returned UDF returns null if the input values 
is null. However, in Spark 3.0, the UDF returns the default value of the Java 
type if the input value is null. For example, `val f = udf((x: Int) => x, 
IntegerType)`, `f($"x")` returns null in Spark 2.4 and below if column `x` is 
null, and return 0 in Spark 3.0. This behavior change is introduced because 
Spark 3.0 is built with Scala 2.12 by default.
+  - In Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` 
is not allowed by default. Remove your dataType info to automatically switch to 
typed Scala udf is recommended. Or, set `spark.sql.legacy.allowUntypedScalaUDF` 
to true to keep using it. In Spark version 2.4 and below, if 
`org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure 
with primitive-type argument, the returned UDF returns null if the input values 
is null. However, in Spark 3.0, the UDF returns the default value of the Java 
type if the input value is null. For example, `val f = udf((x: Int) => x, 
IntegerType)`, `f($"x")` returns null in Spark 2.4 and below if column `x` is 
null, and return 0 in Spark 3.0. This behavior change is introduced because 
Spark 3.0 is built with Scala 2.12 by default.

Review comment:
       Diff: added `Remove your dataType info to automatically switch to typed 
Scala udf is recommended.`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Ngone51 commented on a change in pull request #28311: [SPARK-31010][SQL][DOC][FOLLOW-UP] Improve deprecated warning message for untyped scala udf

Reply via email to