dongjoon-hyun commented on a change in pull request #28125: [SPARK-31351][DOC]
Migration Guide Auditing for Spark 3.0 Release
URL: https://github.com/apache/spark/pull/28125#discussion_r403765566
##########
File path: docs/sql-migration-guide.md
##########
@@ -24,335 +24,195 @@ license: |
## Upgrading from Spark SQL 3.0 to 3.1
- - Since Spark 3.1, grouping_id() returns long values. In Spark version 3.0
and earlier, this function returns int values. To restore the behavior before
Spark 3.0, you can set `spark.sql.legacy.integerGroupingId` to `true`.
+ - In Spark 3.1, grouping_id() returns long values. In Spark version 3.0 and
earlier, this function returns int values. To restore the behavior before Spark
3.0, you can set `spark.sql.legacy.integerGroupingId` to `true`.
- - Since Spark 3.1, SQL UI data adopts the `formatted` mode for the query
plan explain results. To restore the behavior before Spark 3.0, you can set
`spark.sql.ui.explainMode` to `extended`.
+ - In Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan
explain results. To restore the behavior before Spark 3.0, you can set
`spark.sql.ui.explainMode` to `extended`.
## Upgrading from Spark SQL 2.4 to 3.0
### Dataset/DataFrame APIs
- - Since Spark 3.0, the Dataset and DataFrame API `unionAll` is not
deprecated any more. It is an alias for `union`.
+ - In Spark 3.0, the Dataset and DataFrame API `unionAll` is no longer
deprecated. It is an alias for `union`.
- - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a
grouped dataset with key attribute wrongly named as "value", if the key is
non-struct type, e.g. int, string, array, etc. This is counterintuitive and
makes the schema of aggregation queries weird. For example, the schema of
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the
grouping attribute to "key". The old behaviour is preserved under a newly added
configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a
default value of `false`.
+ - In Spark 2.4 and below, `Dataset.groupByKey` results to a grouped dataset
with key attribute is wrongly named as "value", if the key is non-struct type,
for example, int, string, array, etc. This is counterintuitive and makes the
schema of aggregation queries unexpected. For example, the schema of
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the
grouping attribute to "key". The old behavior is preserved under a newly added
configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a
default value of `false`.
### DDL Statements
- - Since Spark 3.0, `CREATE TABLE` without a specific provider will use the
value of `spark.sql.sources.default` as its provider. In Spark version 2.4 and
earlier, it was hive. To restore the behavior before Spark 3.0, you can set
`spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.
+ - In Spark 3.0, `CREATE TABLE` without a specific provider uses the value of
`spark.sql.sources.default` as its provider. In Spark version 2.4 and below, it
was Hive. To restore the behavior before Spark 3.0, you can set
`spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.
- - Since Spark 3.0, when inserting a value into a table column with a
different data type, the type coercion is performed as per ANSI SQL standard.
Certain unreasonable type conversions such as converting `string` to `int` and
`double` to `boolean` are disallowed. A runtime exception will be thrown if the
value is out-of-range for the data type of the column. In Spark version 2.4 and
earlier, type conversions during table insertion are allowed as long as they
are valid `Cast`. When inserting an out-of-range value to a integral field, the
low-order bits of the value is inserted(the same as Java/Scala numeric type
casting). For example, if 257 is inserted to a field of byte type, the result
is 1. The behavior is controlled by the option
`spark.sql.storeAssignmentPolicy`, with a default value as "ANSI". Setting the
option as "Legacy" restores the previous behavior.
+ - In Spark 3.0, when inserting a value into a table column with a different
data type, the type coercion is performed as per ANSI SQL standard. Certain
unreasonable type conversions such as converting `string` to `int` and `double`
to `boolean` are disallowed. A runtime exception is thrown if the value is
out-of-range for the data type of the column. In Spark version 2.4 and below,
type conversions during table insertion are allowed as long as they are valid
`Cast`. When inserting an out-of-range value to a integral field, the low-order
bits of the value is inserted(the same as Java/Scala numeric type casting). For
example, if 257 is inserted to a field of byte type, the result is 1. The
behavior is controlled by the option `spark.sql.storeAssignmentPolicy`, with a
default value as "ANSI". Setting the option as "Legacy" restores the previous
behavior.
- The `ADD JAR` command previously returned a result set with the single
value 0. It now returns an empty result set.
- - In Spark version 2.4 and earlier, the `SET` command works without any
warnings even if the specified key is for `SparkConf` entries and it has no
effect because the command does not update `SparkConf`, but the behavior might
confuse users. Since 3.0, the command fails if a `SparkConf` key is used. You
can disable such a check by setting
`spark.sql.legacy.setCommandRejectsSparkCoreConfs` to `false`.
+ - Spark 2.4 and below: the `SET` command works without any warnings even if
the specified key is for `SparkConf` entries and it has no effect because the
command does not update `SparkConf`, but the behavior might confuse users. In
3.0, the command fails if a `SparkConf` key is used. You can disable such a
check by setting `spark.sql.legacy.setCommandRejectsSparkCoreConfs` to `false`.
- - Refreshing a cached table would trigger a table uncache operation and then
a table cache (lazily) operation. In Spark version 2.4 and earlier, the cache
name and storage level are not preserved before the uncache operation.
Therefore, the cache name and storage level could be changed unexpectedly.
Since Spark 3.0, cache name and storage level will be first preserved for cache
recreation. It helps to maintain a consistent cache behavior upon table
refreshing.
+ - Refreshing a cached table would trigger a table uncache operation and then
a table cache (lazily) operation. In Spark version 2.4 and below, the cache
name and storage level are not preserved before the uncache operation.
Therefore, the cache name and storage level could be changed unexpectedly. In
Spark 3.0, cache name and storage level are first preserved for cache
recreation. It helps to maintain a consistent cache behavior upon table
refreshing.
- - Since Spark 3.0, the properties listing below become reserved, commands
will fail if we specify reserved properties in places like `CREATE DATABASE ...
WITH DBPROPERTIES` and `ALTER TABLE ... SET TBLPROPERTIES`. We need their
specific clauses to specify them, e.g. `CREATE DATABASE test COMMENT 'any
comment' LOCATION 'some path'`. We can set
`spark.sql.legacy.notReserveProperties` to `true` to ignore the
`ParseException`, in this case, these properties will be silently removed, e.g
`SET DBPROTERTIES('location'='/tmp')` will affect nothing. In Spark version 2.4
and earlier, these properties are neither reserved nor have side effects, e.g.
`SET DBPROTERTIES('location'='/tmp')` will not change the location of the
database but only create a headless property just like `'a'='b'`.
- <table class="table">
- <tr>
- <th>
- <b>Property(case sensitive)</b>
- </th>
- <th>
- <b>Database Reserved</b>
- </th>
- <th>
- <b>Table Reserved</b>
- </th>
- <th>
- <b>Remarks</b>
- </th>
- </tr>
- <tr>
- <td>
- provider
- </td>
- <td>
- no
- </td>
- <td>
- yes
- </td>
- <td>
- For tables, please use the USING clause to specify it. Once set,
it can't be changed.
- </td>
- </tr>
- <tr>
- <td>
- location
- </td>
- <td>
- yes
- </td>
- <td>
- yes
- </td>
- <td>
- For databases and tables, please use the LOCATION clause to
specify it.
- </td>
- </tr>
- <tr>
- <td>
- owner
- </td>
- <td>
- yes
- </td>
- <td>
- yes
- </td>
- <td>
- For databases and tables, it is determined by the user who runs
spark and create the table.
- </td>
- </tr>
- </table>
+ - In Spark 3.0, the properties listing below become reserved; commands fail
if you specify reserved properties in places like `CREATE DATABASE ... WITH
DBPROPERTIES` and `ALTER TABLE ... SET TBLPROPERTIES`. You need their specific
clauses to specify them, for example, `CREATE DATABASE test COMMENT 'any
comment' LOCATION 'some path'`. You can set
`spark.sql.legacy.notReserveProperties` to `true` to ignore the
`ParseException`, in this case, these properties will be silently removed, for
example: `SET DBPROTERTIES('location'='/tmp')` will have no effect. In Spark
version 2.4 and below, these properties are neither reserved nor have side
effects, for example, `SET DBPROTERTIES('location'='/tmp')` do not change the
location of the database but only create a headless property just like
`'a'='b'`.
+
+ | Property (case sensitive) | Database Reserved | Table Reserved | Remarks
|
+ | ------------------------- | ----------------- | -------------- | -------
|
+ | provider | no | yes | For
tables, use the `USING` clause to specify it. Once set, it can't be changed. |
+ | location | yes | yes | For
databases and tables, use the `LOCATION` clause to specify it. |
+ | owner | yes | yes | For
databases and tables, it is determined by the user who runs spark and create
the table. |
- - Since Spark 3.0, `ADD FILE` can be used to add file directories as well.
Earlier only single files can be added using this command. To restore the
behaviour of earlier versions, set `spark.sql.legacy.addSingleFileInAddFile` to
`true`.
+
+ - In Spark 3.0, you can use `ADD FILE` to add file directories as well.
Earlier you could add only single files using this command. To restore the
behavior of earlier versions, set `spark.sql.legacy.addSingleFileInAddFile` to
`true`.
- - Since Spark 3.0, `SHOW TBLPROPERTIES` will cause `AnalysisException` if
the table does not exist. In Spark version 2.4 and earlier, this scenario
caused `NoSuchTableException`. Also, `SHOW TBLPROPERTIES` on a temporary view
will cause `AnalysisException`. In Spark version 2.4 and earlier, it returned
an empty result.
+ - In Spark 3.0, `SHOW TBLPROPERTIES` throws `AnalysisException` if the table
does not exist. In Spark version 2.4 and below, this scenario caused
`NoSuchTableException`. Also, `SHOW TBLPROPERTIES` on a temporary view causes
`AnalysisException`. In Spark version 2.4 and below, it returned an empty
result.
- - Since Spark 3.0, `SHOW CREATE TABLE` will always return Spark DDL, even
when the given table is a Hive serde table. For generating Hive DDL, please use
`SHOW CREATE TABLE AS SERDE` command instead.
+ - In Spark 3.0, `SHOW CREATE TABLE` always returns Spark DDL, even when the
given table is a Hive SerDe table. For generating Hive DDL, use `SHOW CREATE
TABLE AS SERDE` command instead.
- - Since Spark 3.0, column of CHAR type is not allowed in non-Hive-Serde
tables, and CREATE/ALTER TABLE commands will fail if CHAR type is detected.
Please use STRING type instead. In Spark version 2.4 and earlier, CHAR type is
treated as STRING type and the length parameter is simply ignored.
+ - In Spark 3.0, column of CHAR type is not allowed in non-Hive-Serde tables,
and CREATE/ALTER TABLE commands will fail if CHAR type is detected. Please use
STRING type instead. In Spark version 2.4 and below, CHAR type is treated as
STRING type and the length parameter is simply ignored.
### UDFs and Built-in Functions
- - Since Spark 3.0, the `date_add` and `date_sub` functions only accept int,
smallint, tinyint as the 2nd argument, fractional and non-literal string are
not valid anymore, e.g. `date_add(cast('1964-05-23' as date), 12.34)` will
cause `AnalysisException`. Note that, string literals are still allowed, but
Spark will throw Analysis Exception if the string content is not a valid
integer. In Spark version 2.4 and earlier, if the 2nd argument is fractional or
string value, it will be coerced to int value, and the result will be a date
value of `1964-06-04`.
+ - In Spark 3.0, the `date_add` and `date_sub` functions accepts only int,
smallint, tinyint as the 2nd argument; fractional and non-literal strings are
not valid anymore, for example: `date_add(cast('1964-05-23' as date), '12.34')`
causes `AnalysisException`. Note that, string literals are still allowed, but
Spark will throw `AnalysisException` if the string content is not a valid
integer. In Spark version 2.4 and below, if the 2nd argument is fractional or
string value, it is coerced to int value, and the result is a date value of
`1964-06-04`.
- - Since Spark 3.0, the function `percentile_approx` and its alias
`approx_percentile` only accept integral value with range in `[1, 2147483647]`
as its 3rd argument `accuracy`, fractional and string types are disallowed,
e.g. `percentile_approx(10.0, 0.2, 1.8D)` will cause `AnalysisException`. In
Spark version 2.4 and earlier, if `accuracy` is fractional or string value, it
will be coerced to an int value, `percentile_approx(10.0, 0.2, 1.8D)` is
operated as `percentile_approx(10.0, 0.2, 1)` which results in `10.0`.
+ - In Spark 3.0, the function `percentile_approx` and its alias
`approx_percentile` only accept integral value with range in `[1, 2147483647]`
as its 3rd argument `accuracy`, fractional and string types are disallowed, for
example, `percentile_approx(10.0, 0.2, 1.8D)` causes `AnalysisException`. In
Spark version 2.4 and below, if `accuracy` is fractional or string value, it is
coerced to an int value, `percentile_approx(10.0, 0.2, 1.8D)` is operated as
`percentile_approx(10.0, 0.2, 1)` which results in `10.0`.
- - Since Spark 3.0, an analysis exception will be thrown when hash
expressions are applied on elements of MapType. To restore the behavior before
Spark 3.0, set `spark.sql.legacy.allowHashOnMapType` to `true`.
+ - In Spark 3.0, an analysis exception is thrown when hash expressions are
applied on elements of `MapType`. To restore the behavior before Spark 3.0, set
`spark.sql.legacy.allowHashOnMapType` to `true`.
- - Since Spark 3.0, when the `array`/`map` function is called without any
parameters, it returns an empty collection with `NullType` as element type. In
Spark version 2.4 and earlier, it returns an empty collection with `StringType`
as element type. To restore the behavior before Spark 3.0, you can set
`spark.sql.legacy.createEmptyCollectionUsingStringType` to `true`.
+ - In Spark 3.0, when the `array`/`map` function is called without any
parameters, it returns an empty collection with `NullType` as element type. In
Spark version 2.4 and below, it returns an empty collection with `StringType`
as element type. To restore the behavior before Spark 3.0, you can set
`spark.sql.legacy.createEmptyCollectionUsingStringType` to `true`.
- - Since Spark 3.0, the `from_json` functions supports two modes -
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json`
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing
of malformed JSON records. For example, the JSON string `{"a" 1}` with the
schema `a INT` is converted to `null` by previous versions but Spark 3.0
converts it to `Row(null)`.
+ - In Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE`
and `FAILFAST`. The modes can be set via the `mode` option. The default mode
became `PERMISSIVE`. In previous versions, behavior of `from_json` did not
conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of
malformed JSON records. For example, the JSON string `{"a" 1}` with the schema
`a INT` is converted to `null` by previous versions but Spark 3.0 converts it
to `Row(null)`.
- - In Spark version 2.4 and earlier, users can create map values with map
type key via built-in function such as `CreateMap`, `MapFromArrays`, etc. Since
Spark 3.0, it's not allowed to create map values with map type key with these
built-in functions. Users can use `map_entries` function to convert map to
array<struct<key, value>> as a workaround. In addition, users can still read
map values with map type key from data source or Java/Scala collections, though
it is discouraged.
+ - In Spark version 2.4 and below, you can create map values with map type
key via built-in function such as `CreateMap`, `MapFromArrays`, etc. In Spark
3.0, it's not allowed to create map values with map type key with these
built-in functions. Users can use `map_entries` function to convert map to
array<struct<key, value>> as a workaround. In addition, users can still read
map values with map type key from data source or Java/Scala collections, though
it is discouraged.
- - In Spark version 2.4 and earlier, users can create a map with duplicated
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior
of map with duplicated keys is undefined, e.g. map look up respects the
duplicated key appears first, `Dataset.collect` only keeps the duplicated key
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, Spark
will throw RuntimeException while duplicated keys are found. Users can set
`spark.sql.mapKeyDedupPolicy` to LAST_WIN to deduplicate map keys with last
wins policy. Users may still read map values with duplicated keys from data
sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+ - In In Spark version 2.4 and below, you can create a map with duplicated
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior
of map with duplicated keys is undefined, for example, map look up respects the
duplicated key appears first, `Dataset.collect` only keeps the duplicated key
appears last, `MapKeys` returns duplicated keys, etc. In Spark 3.0, Spark
throws `RuntimeException` when duplicated keys are found. You can set
`spark.sql.mapKeyDedupPolicy` to `LAST_WIN` to deduplicate map keys with last
wins policy. Users may still read map values with duplicated keys from data
sources which do not enforce it (for example, Parquet), the behavior is
undefined.
- - Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef,
DataType)` is not allowed by default. Set
`spark.sql.legacy.allowUntypedScalaUDF` to true to keep using it. But please
note that, in Spark version 2.4 and earlier, if
`org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure
with primitive-type argument, the returned UDF will return null if the input
values is null. However, since Spark 3.0, the UDF will return the default value
of the Java type if the input value is null. For example, `val f = udf((x: Int)
=> x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if
column `x` is null, and return 0 in Spark 3.0. This behavior change is
introduced because Spark 3.0 is built with Scala 2.12 by default.
+ - In Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)`
is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true
to keep using it. In Spark version 2.4 and below, if
`org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure
with primitive-type argument, the returned UDF returns null if the input values
is null. However, in Spark 3.0, the UDF returns the default value of the Java
type if the input value is null. For example, `val f = udf((x: Int) => x,
IntegerType)`, `f($"x")` returns null in Spark 2.4 and below if column `x` is
null, and return 0 in Spark 3.0. This behavior change is introduced because
Spark 3.0 is built with Scala 2.12 by default.
- - Since Spark 3.0, a higher-order function `exists` follows the three-valued
boolean logic, i.e., if the `predicate` returns any `null`s and no `true` is
obtained, then `exists` will return `null` instead of `false`. For example,
`exists(array(1, null, 3), x -> x % 2 == 0)` will be `null`. The previous
behaviour can be restored by setting
`spark.sql.legacy.followThreeValuedLogicInArrayExists` to `false`.
+ - In Spark 3.0, a higher-order function `exists` follows the three-valued
boolean logic, that is, if the `predicate` returns any `null`s and no `true` is
obtained, then `exists` returns `null` instead of `false`. For example,
`exists(array(1, null, 3), x -> x % 2 == 0)` is `null`. The previous
behaviorcan be restored by setting
`spark.sql.legacy.followThreeValuedLogicInArrayExists` to `false`.
- - Since Spark 3.0, the `add_months` function does not adjust the resulting
date to a last day of month if the original date is a last day of months. For
example, `select add_months(DATE'2019-02-28', 1)` results `2019-03-28`. In
Spark version 2.4 and earlier, the resulting date is adjusted when the original
date is a last day of months. For example, adding a month to `2019-02-28`
results in `2019-03-31`.
+ - In Spark 3.0, the `add_months` function does not adjust the resulting date
to a last day of month if the original date is a last day of months. For
example, `select add_months(DATE'2019-02-28', 1)` results `2019-03-28`. In
Spark version 2.4 and below, the resulting date is adjusted when the original
date is a last day of months. For example, adding a month to `2019-02-28`
results in `2019-03-31`.
- - In Spark version 2.4 and earlier, the `current_timestamp` function returns
a timestamp with millisecond resolution only. Since Spark 3.0, the function can
return the result with microsecond resolution if the underlying clock available
on the system offers such resolution.
+ - In Spark version 2.4 and below, the `current_timestamp` function returns a
timestamp with millisecond resolution only. In Spark 3.0, the function can
return the result with microsecond resolution if the underlying clock available
on the system offers such resolution.
- - Since Spark 3.0, 0-argument Java UDF is executed in the executor side
identically with other UDFs. In Spark version 2.4 and earlier, 0-argument Java
UDF alone was executed in the driver side, and the result was propagated to
executors, which might be more performant in some cases but caused
inconsistency with a correctness issue in some cases.
+ - In Spark 3.0, a 0-argument Java UDF is executed in the executor side
identically with other UDFs. In Spark version 2.4 and below, 0-argument Java
UDF alone was executed in the driver side, and the result was propagated to
executors, which might be more performant in some cases but caused
inconsistency with a correctness issue in some cases.
Review comment:
`a` or `the` for the second `0-argument Java UDF` because we add `a` for the
first one?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]