[GitHub] [spark] HyukjinKwon commented on a change in pull request #28311: [SPARK-31010][SQL][DOC][FOLLOW-UP] Improve deprecated warning message for untyped scala udf

GitBox Thu, 23 Apr 2020 19:18:34 -0700


HyukjinKwon commented on a change in pull request #28311:
URL: https://github.com/apache/spark/pull/28311#discussion_r414243277




##########
File path: docs/sql-migration-guide.md
##########
@@ -81,7 +81,7 @@ license: |
 
   - In Spark version 2.4 and below, you can create a map with duplicated keys 
via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior of 
map with duplicated keys is undefined, for example, map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. In Spark 3.0, Spark 
throws `RuntimeException` when duplicated keys are found. You can set 
`spark.sql.mapKeyDedupPolicy` to `LAST_WIN` to deduplicate map keys with last 
wins policy. Users may still read map values with duplicated keys from data 
sources which do not enforce it (for example, Parquet), the behavior is 
undefined.
 
-  - In Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` 
is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true 
to keep using it. In Spark version 2.4 and below, if 
`org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure 
with primitive-type argument, the returned UDF returns null if the input values 
is null. However, in Spark 3.0, the UDF returns the default value of the Java 
type if the input value is null. For example, `val f = udf((x: Int) => x, 
IntegerType)`, `f($"x")` returns null in Spark 2.4 and below if column `x` is 
null, and return 0 in Spark 3.0. This behavior change is introduced because 
Spark 3.0 is built with Scala 2.12 by default.
+  - In Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` 
is not allowed by default. Remove the return type parameter to automatically 
switch to typed Scala udf is recommended. Or, set 
`spark.sql.legacy.allowUntypedScalaUDF` to true to keep using it. In Spark 
version 2.4 and below, if `org.apache.spark.sql.functions.udf(AnyRef, 
DataType)` gets a Scala closure with primitive-type argument, the returned UDF 
returns null if the input values is null. However, in Spark 3.0, the UDF 
returns the default value of the Java type if the input value is null. For 
example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` returns null in 
Spark 2.4 and below if column `x` is null, and return 0 in Spark 3.0. This 
behavior change is introduced because Spark 3.0 is built with Scala 2.12 by 
default.

Review comment:
       nit
   ```
   Remove the return type parameter to automatically switch to typed Scala udf 
is recommended. Or, set `spark.sql.legacy.allowUntypedScalaUDF` to true to keep 
using it.
   ```
   =>
   
   ```
   Remove the return type parameter to automatically switch to typed Scala udf 
is recommended, or set `spark.sql.legacy.allowUntypedScalaUDF` to true to keep 
using it.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28311: [SPARK-31010][SQL][DOC][FOLLOW-UP] Improve deprecated warning message for untyped scala udf

Reply via email to