[GitHub] [spark] gatorsmile commented on a change in pull request #28125: [SPARK-31351][DOC] Migration Guide Auditing for Spark 3.0 Release

GitBox Tue, 07 Apr 2020 15:02:10 -0700

gatorsmile commented on a change in pull request #28125: [SPARK-31351][DOC] 
Migration Guide Auditing for Spark 3.0 Release
URL: https://github.com/apache/spark/pull/28125#discussion_r405139431


 ##########
 File path: docs/sql-migration-guide.md
 ##########
 @@ -24,335 +24,195 @@ license: |
 
 ## Upgrading from Spark SQL 3.0 to 3.1
 
-  - Since Spark 3.1, grouping_id() returns long values. In Spark version 3.0 
and earlier, this function returns int values. To restore the behavior before 
Spark 3.0, you can set `spark.sql.legacy.integerGroupingId` to `true`.
+  - In Spark 3.1, grouping_id() returns long values. In Spark version 3.0 and 
earlier, this function returns int values. To restore the behavior before Spark 
3.0, you can set `spark.sql.legacy.integerGroupingId` to `true`.
 
-  - Since Spark 3.1, SQL UI data adopts the `formatted` mode for the query 
plan explain results. To restore the behavior before Spark 3.0, you can set 
`spark.sql.ui.explainMode` to `extended`.
+  - In Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan 
explain results. To restore the behavior before Spark 3.0, you can set 
`spark.sql.ui.explainMode` to `extended`.
 
 ## Upgrading from Spark SQL 2.4 to 3.0
 
 ### Dataset/DataFrame APIs
 
-  - Since Spark 3.0, the Dataset and DataFrame API `unionAll` is not 
deprecated any more. It is an alias for `union`.
+  - In Spark 3.0, the Dataset and DataFrame API `unionAll` is no longer 
deprecated. It is an alias for `union`.
 
-  - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a 
grouped dataset with key attribute wrongly named as "value", if the key is 
non-struct type, e.g. int, string, array, etc. This is counterintuitive and 
makes the schema of aggregation queries weird. For example, the schema of 
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the 
grouping attribute to "key". The old behaviour is preserved under a newly added 
configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a 
default value of `false`.
+  - In Spark 2.4 and below, `Dataset.groupByKey` results to a grouped dataset 
with key attribute is wrongly named as "value", if the key is non-struct type, 
for example, int, string, array, etc. This is counterintuitive and makes the 
schema of aggregation queries unexpected. For example, the schema of 
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the 
grouping attribute to "key". The old behavior is preserved under a newly added 
configuration `spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue` with a 
default value of `false`.
 
 ### DDL Statements
 
-  - Since Spark 3.0, `CREATE TABLE` without a specific provider will use the 
value of `spark.sql.sources.default` as its provider. In Spark version 2.4 and 
earlier, it was hive. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.
+  - In Spark 3.0, `CREATE TABLE` without a specific provider uses the value of 
`spark.sql.sources.default` as its provider. In Spark version 2.4 and below, it 
was Hive. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.
 
-  - Since Spark 3.0, when inserting a value into a table column with a 
different data type, the type coercion is performed as per ANSI SQL standard. 
Certain unreasonable type conversions such as converting `string` to `int` and 
`double` to `boolean` are disallowed. A runtime exception will be thrown if the 
value is out-of-range for the data type of the column. In Spark version 2.4 and 
earlier, type conversions during table insertion are allowed as long as they 
are valid `Cast`. When inserting an out-of-range value to a integral field, the 
low-order bits of the value is inserted(the same as Java/Scala numeric type 
casting). For example, if 257 is inserted to a field of byte type, the result 
is 1. The behavior is controlled by the option 
`spark.sql.storeAssignmentPolicy`, with a default value as "ANSI". Setting the 
option as "Legacy" restores the previous behavior.
+  - In Spark 3.0, when inserting a value into a table column with a different 
data type, the type coercion is performed as per ANSI SQL standard. Certain 
unreasonable type conversions such as converting `string` to `int` and `double` 
to `boolean` are disallowed. A runtime exception is thrown if the value is 
out-of-range for the data type of the column. In Spark version 2.4 and below, 
type conversions during table insertion are allowed as long as they are valid 
`Cast`. When inserting an out-of-range value to a integral field, the low-order 
bits of the value is inserted(the same as Java/Scala numeric type casting). For 
example, if 257 is inserted to a field of byte type, the result is 1. The 
behavior is controlled by the option `spark.sql.storeAssignmentPolicy`, with a 
default value as "ANSI". Setting the option as "Legacy" restores the previous 
behavior.
 
   - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
-  - In Spark version 2.4 and earlier, the `SET` command works without any 
warnings even if the specified key is for `SparkConf` entries and it has no 
effect because the command does not update `SparkConf`, but the behavior might 
confuse users. Since 3.0, the command fails if a `SparkConf` key is used. You 
can disable such a check by setting 
`spark.sql.legacy.setCommandRejectsSparkCoreConfs` to `false`.
+  - Spark 2.4 and below: the `SET` command works without any warnings even if 
the specified key is for `SparkConf` entries and it has no effect because the 
command does not update `SparkConf`, but the behavior might confuse users. In 
3.0, the command fails if a `SparkConf` key is used. You can disable such a 
check by setting `spark.sql.legacy.setCommandRejectsSparkCoreConfs` to `false`.
 
-  - Refreshing a cached table would trigger a table uncache operation and then 
a table cache (lazily) operation. In Spark version 2.4 and earlier, the cache 
name and storage level are not preserved before the uncache operation. 
Therefore, the cache name and storage level could be changed unexpectedly. 
Since Spark 3.0, cache name and storage level will be first preserved for cache 
recreation. It helps to maintain a consistent cache behavior upon table 
refreshing.
+  - Refreshing a cached table would trigger a table uncache operation and then 
a table cache (lazily) operation. In Spark version 2.4 and below, the cache 
name and storage level are not preserved before the uncache operation. 
Therefore, the cache name and storage level could be changed unexpectedly. In 
Spark 3.0, cache name and storage level are first preserved for cache 
recreation. It helps to maintain a consistent cache behavior upon table 
refreshing.
 
-  - Since Spark 3.0, the properties listing below become reserved, commands 
will fail if we specify reserved properties in places like `CREATE DATABASE ... 
WITH DBPROPERTIES` and `ALTER TABLE ... SET TBLPROPERTIES`. We need their 
specific clauses to specify them, e.g. `CREATE DATABASE test COMMENT 'any 
comment' LOCATION 'some path'`. We can set 
`spark.sql.legacy.notReserveProperties` to `true` to ignore the 
`ParseException`, in this case, these properties will be silently removed, e.g 
`SET DBPROTERTIES('location'='/tmp')` will affect nothing. In Spark version 2.4 
and earlier, these properties are neither reserved nor have side effects, e.g. 
`SET DBPROTERTIES('location'='/tmp')` will not change the location of the 
database but only create a headless property just like `'a'='b'`.
-    <table class="table">
-        <tr>
-          <th>
-            <b>Property(case sensitive)</b>
-          </th>
-          <th>
-            <b>Database Reserved</b>
-          </th>
-          <th>
-            <b>Table Reserved</b>
-          </th>
-          <th>
-            <b>Remarks</b>
-          </th>
-        </tr>
-        <tr>
-          <td>
-            provider
-          </td>
-          <td>
-            no
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            For tables, please use the USING clause to specify it. Once set, 
it can't be changed.
-          </td>
-        </tr>
-        <tr>
-          <td>
-            location
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            For databases and tables, please use the LOCATION clause to 
specify it.
-          </td>
-        </tr>
-        <tr>
-          <td>
-            owner
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            For databases and tables, it is determined by the user who runs 
spark and create the table.
-          </td>
-        </tr>
-    </table>
+  - In Spark 3.0, the properties listing below become reserved; commands fail 
if you specify reserved properties in places like `CREATE DATABASE ... WITH 
DBPROPERTIES` and `ALTER TABLE ... SET TBLPROPERTIES`. You need their specific 
clauses to specify them, for example, `CREATE DATABASE test COMMENT 'any 
comment' LOCATION 'some path'`. You can set 
`spark.sql.legacy.notReserveProperties` to `true` to ignore the 
`ParseException`, in this case, these properties will be silently removed, for 
example: `SET DBPROTERTIES('location'='/tmp')` will have no effect. In Spark 
version 2.4 and below, these properties are neither reserved nor have side 
effects, for example, `SET DBPROTERTIES('location'='/tmp')` do not change the 
location of the database but only create a headless property just like 
`'a'='b'`.
+
+    | Property (case sensitive) | Database Reserved | Table Reserved | Remarks 
|
+    | ------------------------- | ----------------- | -------------- | ------- 
|
+    | provider                 | no                | yes            | For 
tables, use the `USING` clause to specify it. Once set, it can't be changed. |
+    | location                 | yes               | yes            | For 
databases and tables, use the `LOCATION` clause to specify it. |
+    | owner                    | yes               | yes            | For 
databases and tables, it is determined by the user who runs spark and create 
the table. |
 
-  - Since Spark 3.0, `ADD FILE` can be used to add file directories as well. 
Earlier only single files can be added using this command. To restore the 
behaviour of earlier versions, set `spark.sql.legacy.addSingleFileInAddFile` to 
`true`.
+ 
+  - In Spark 3.0, you can use `ADD FILE` to add file directories as well. 
Earlier you could add only single files using this command. To restore the 
behavior of earlier versions, set `spark.sql.legacy.addSingleFileInAddFile` to 
`true`.
 
-  - Since Spark 3.0, `SHOW TBLPROPERTIES` will cause `AnalysisException` if 
the table does not exist. In Spark version 2.4 and earlier, this scenario 
caused `NoSuchTableException`. Also, `SHOW TBLPROPERTIES` on a temporary view 
will cause `AnalysisException`. In Spark version 2.4 and earlier, it returned 
an empty result.
+  - In Spark 3.0, `SHOW TBLPROPERTIES` throws `AnalysisException` if the table 
does not exist. In Spark version 2.4 and below, this scenario caused 
`NoSuchTableException`. Also, `SHOW TBLPROPERTIES` on a temporary view causes 
`AnalysisException`. In Spark version 2.4 and below, it returned an empty 
result.
 
-  - Since Spark 3.0, `SHOW CREATE TABLE` will always return Spark DDL, even 
when the given table is a Hive serde table. For generating Hive DDL, please use 
`SHOW CREATE TABLE AS SERDE` command instead.
+  - In Spark 3.0, `SHOW CREATE TABLE` always returns Spark DDL, even when the 
given table is a Hive SerDe table. For generating Hive DDL, use `SHOW CREATE 
TABLE AS SERDE` command instead.
 
-  - Since Spark 3.0, column of CHAR type is not allowed in non-Hive-Serde 
tables, and CREATE/ALTER TABLE commands will fail if CHAR type is detected. 
Please use STRING type instead. In Spark version 2.4 and earlier, CHAR type is 
treated as STRING type and the length parameter is simply ignored.
+  - In Spark 3.0, column of CHAR type is not allowed in non-Hive-Serde tables, 
and CREATE/ALTER TABLE commands will fail if CHAR type is detected. Please use 
STRING type instead. In Spark version 2.4 and below, CHAR type is treated as 
STRING type and the length parameter is simply ignored.
 
 ### UDFs and Built-in Functions
 
-  - Since Spark 3.0, the `date_add` and `date_sub` functions only accept int, 
smallint, tinyint as the 2nd argument, fractional and non-literal string are 
not valid anymore, e.g. `date_add(cast('1964-05-23' as date), 12.34)` will 
cause `AnalysisException`. Note that, string literals are still allowed, but 
Spark will throw Analysis Exception if the string content is not a valid 
integer. In Spark version 2.4 and earlier, if the 2nd argument is fractional or 
string value, it will be coerced to int value, and the result will be a date 
value of `1964-06-04`.
+  - In Spark 3.0, the `date_add` and `date_sub` functions accepts only int, 
smallint, tinyint as the 2nd argument; fractional and non-literal strings are 
not valid anymore, for example: `date_add(cast('1964-05-23' as date), '12.34')` 
causes `AnalysisException`. Note that, string literals are still allowed, but 
Spark will throw `AnalysisException` if the string content is not a valid 
integer. In Spark version 2.4 and below, if the 2nd argument is fractional or 
string value, it is coerced to int value, and the result is a date value of 
`1964-06-04`.
 
-  - Since Spark 3.0, the function `percentile_approx` and its alias 
`approx_percentile` only accept integral value with range in `[1, 2147483647]` 
as its 3rd argument `accuracy`, fractional and string types are disallowed, 
e.g. `percentile_approx(10.0, 0.2, 1.8D)` will cause `AnalysisException`. In 
Spark version 2.4 and earlier, if `accuracy` is fractional or string value, it 
will be coerced to an int value, `percentile_approx(10.0, 0.2, 1.8D)` is 
operated as `percentile_approx(10.0, 0.2, 1)` which results in `10.0`.
+  - In Spark 3.0, the function `percentile_approx` and its alias 
`approx_percentile` only accept integral value with range in `[1, 2147483647]` 
as its 3rd argument `accuracy`, fractional and string types are disallowed, for 
example, `percentile_approx(10.0, 0.2, 1.8D)` causes `AnalysisException`. In 
Spark version 2.4 and below, if `accuracy` is fractional or string value, it is 
coerced to an int value, `percentile_approx(10.0, 0.2, 1.8D)` is operated as 
`percentile_approx(10.0, 0.2, 1)` which results in `10.0`.
 
-  - Since Spark 3.0, an analysis exception will be thrown when hash 
expressions are applied on elements of MapType. To restore the behavior before 
Spark 3.0, set `spark.sql.legacy.allowHashOnMapType` to `true`.
+  - In Spark 3.0, an analysis exception is thrown when hash expressions are 
applied on elements of `MapType`. To restore the behavior before Spark 3.0, set 
`spark.sql.legacy.allowHashOnMapType` to `true`.
 
-  - Since Spark 3.0, when the `array`/`map` function is called without any 
parameters, it returns an empty collection with `NullType` as element type. In 
Spark version 2.4 and earlier, it returns an empty collection with `StringType` 
as element type. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.createEmptyCollectionUsingStringType` to `true`.
+  - In Spark 3.0, when the `array`/`map` function is called without any 
parameters, it returns an empty collection with `NullType` as element type. In 
Spark version 2.4 and below, it returns an empty collection with `StringType` 
as element type. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.createEmptyCollectionUsingStringType` to `true`.
 
-  - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
+  - In Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` 
and `FAILFAST`. The modes can be set via the `mode` option. The default mode 
became `PERMISSIVE`. In previous versions, behavior of `from_json` did not 
conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of 
malformed JSON records. For example, the JSON string `{"a" 1}` with the schema 
`a INT` is converted to `null` by previous versions but Spark 3.0 converts it 
to `Row(null)`.
 
-  - In Spark version 2.4 and earlier, users can create map values with map 
type key via built-in function such as `CreateMap`, `MapFromArrays`, etc. Since 
Spark 3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can use `map_entries` function to convert map to 
array<struct<key, value>> as a workaround. In addition, users can still read 
map values with map type key from data source or Java/Scala collections, though 
it is discouraged.
+  - In Spark version 2.4 and below, you can create map values with map type 
key via built-in function such as `CreateMap`, `MapFromArrays`, etc. In Spark 
3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can use `map_entries` function to convert map to 
array<struct<key, value>> as a workaround. In addition, users can still read 
map values with map type key from data source or Java/Scala collections, though 
it is discouraged.
 
-  - In Spark version 2.4 and earlier, users can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, e.g. map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, Spark 
will throw RuntimeException while duplicated keys are found. Users can set 
`spark.sql.mapKeyDedupPolicy` to LAST_WIN to deduplicate map keys with last 
wins policy. Users may still read map values with duplicated keys from data 
sources which do not enforce it (e.g. Parquet), the behavior will be undefined.
+  - In In Spark version 2.4 and below, you  can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, for example, map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. In Spark 3.0, Spark 
throws `RuntimeException` when duplicated keys are found. You can set 
`spark.sql.mapKeyDedupPolicy` to `LAST_WIN` to deduplicate map keys with last 
wins policy. Users may still read map values with duplicated keys from data 
sources which do not enforce it (for example, Parquet), the behavior is 
undefined.
 
-  - Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, 
DataType)` is not allowed by default. Set 
`spark.sql.legacy.allowUntypedScalaUDF` to true to keep using it. But please 
note that, in Spark version 2.4 and earlier, if 
`org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure 
with primitive-type argument, the returned UDF will return null if the input 
values is null. However, since Spark 3.0, the UDF will return the default value 
of the Java type if the input value is null. For example, `val f = udf((x: Int) 
=> x, IntegerType)`, `f($"x")` will return null in Spark 2.4 and earlier if 
column `x` is null, and return 0 in Spark 3.0. This behavior change is 
introduced because Spark 3.0 is built with Scala 2.12 by default.
+  - In Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, DataType)` 
is not allowed by default. Set `spark.sql.legacy.allowUntypedScalaUDF` to true 
to keep using it. In Spark version 2.4 and below, if 
`org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure 
with primitive-type argument, the returned UDF returns null if the input values 
is null. However, in Spark 3.0, the UDF returns the default value of the Java 
type if the input value is null. For example, `val f = udf((x: Int) => x, 
IntegerType)`, `f($"x")` returns null in Spark 2.4 and below if column `x` is 
null, and return 0 in Spark 3.0. This behavior change is introduced because 
Spark 3.0 is built with Scala 2.12 by default.
 
-  - Since Spark 3.0, a higher-order function `exists` follows the three-valued 
boolean logic, i.e., if the `predicate` returns any `null`s and no `true` is 
obtained, then `exists` will return `null` instead of `false`. For example, 
`exists(array(1, null, 3), x -> x % 2 == 0)` will be `null`. The previous 
behaviour can be restored by setting 
`spark.sql.legacy.followThreeValuedLogicInArrayExists` to `false`.
+  - In Spark 3.0, a higher-order function `exists` follows the three-valued 
boolean logic, that is, if the `predicate` returns any `null`s and no `true` is 
obtained, then `exists` returns `null` instead of `false`. For example, 
`exists(array(1, null, 3), x -> x % 2 == 0)` is `null`. The previous 
behaviorcan be restored by setting 
`spark.sql.legacy.followThreeValuedLogicInArrayExists` to `false`.
 
-  - Since Spark 3.0, the `add_months` function does not adjust the resulting 
date to a last day of month if the original date is a last day of months. For 
example, `select add_months(DATE'2019-02-28', 1)` results `2019-03-28`. In 
Spark version 2.4 and earlier, the resulting date is adjusted when the original 
date is a last day of months. For example, adding a month to `2019-02-28` 
results in `2019-03-31`.
+  - In Spark 3.0, the `add_months` function does not adjust the resulting date 
to a last day of month if the original date is a last day of months. For 
example, `select add_months(DATE'2019-02-28', 1)` results `2019-03-28`. In 
Spark version 2.4 and below, the resulting date is adjusted when the original 
date is a last day of months. For example, adding a month to `2019-02-28` 
results in `2019-03-31`.
 
-  - In Spark version 2.4 and earlier, the `current_timestamp` function returns 
a timestamp with millisecond resolution only. Since Spark 3.0, the function can 
return the result with microsecond resolution if the underlying clock available 
on the system offers such resolution.
+  - In Spark version 2.4 and below, the `current_timestamp` function returns a 
timestamp with millisecond resolution only. In Spark 3.0, the function can 
return the result with microsecond resolution if the underlying clock available 
on the system offers such resolution.
 
-  - Since Spark 3.0, 0-argument Java UDF is executed in the executor side 
identically with other UDFs. In Spark version 2.4 and earlier, 0-argument Java 
UDF alone was executed in the driver side, and the result was propagated to 
executors, which might be more performant in some cases but caused 
inconsistency with a correctness issue in some cases.
+  - In Spark 3.0, a 0-argument Java UDF is executed in the executor side 
identically with other UDFs. In Spark version 2.4 and below, 0-argument Java 
UDF alone was executed in the driver side, and the result was propagated to 
executors, which might be more performant in some cases but caused 
inconsistency with a correctness issue in some cases.
 
   - The result of `java.lang.Math`'s `log`, `log1p`, `exp`, `expm1`, and `pow` 
may vary across platforms. In Spark 3.0, the result of the equivalent SQL 
functions (including related SQL functions like `LOG10`) return values 
consistent with `java.lang.StrictMath`. In virtually all cases this makes no 
difference in the return value, and the difference is very small, but may not 
exactly match `java.lang.Math` on x86 platforms in cases like, for example, 
`log(3.0)`, whose value varies between `Math.log()` and `StrictMath.log()`.
 
-  - Since Spark 3.0, `Cast` function processes string literals such as 
'Infinity', '+Infinity', '-Infinity', 'NaN', 'Inf', '+Inf', '-Inf' in case 
insensitive manner when casting the literals to `Double` or `Float` type to 
ensure greater compatibility with other database systems. This behaviour change 
is illustrated in the table below:
-    <table class="table">
-        <tr>
-          <th>
-            <b>Operation</b>
-          </th>
-          <th>
-            <b>Result prior to Spark 3.0</b>
-          </th>
-          <th>
-            <b>Result starting Spark 3.0</b>
-          </th>
-        </tr>
-        <tr>
-          <td>
-            CAST('infinity' AS DOUBLE)<br>
-            CAST('+infinity' AS DOUBLE)<br>
-            CAST('inf' AS DOUBLE)<br>
-            CAST('+inf' AS DOUBLE)<br>
-          </td>
-          <td>
-            NULL
-          </td>
-          <td>
-            Double.PositiveInfinity
-          </td>
-        </tr>
-        <tr>
-          <td>
-            CAST('-infinity' AS DOUBLE)<br>
-            CAST('-inf' AS DOUBLE)<br>
-          </td>
-          <td>
-            NULL
-          </td>
-          <td>
-            Double.NegativeInfinity
-          </td>
-        </tr>
-        <tr>
-          <td>
-            CAST('infinity' AS FLOAT)<br>
-            CAST('+infinity' AS FLOAT)<br>
-            CAST('inf' AS FLOAT)<br>
-            CAST('+inf' AS FLOAT)<br>
-          </td>
-          <td>
-            NULL
-          </td>
-          <td>
-            Float.PositiveInfinity
-          </td>
-        </tr>
-        <tr>
-          <td>
-            CAST('-infinity' AS FLOAT)<br>
-            CAST('-inf' AS FLOAT)<br>
-          </td>
-          <td>
-            NULL
-          </td>
-          <td>
-            Float.NegativeInfinity
-          </td>
-        </tr>
-        <tr>
-          <td>
-            CAST('nan' AS DOUBLE)
-          </td>
-          <td>
-            NULL
-          </td>
-          <td>
-            Double.NaN
-          </td>
-        </tr>
-        <tr>
-          <td>
-            CAST('nan' AS FLOAT)
-          </td>
-          <td>
-            NULL
-          </td>
-          <td>
-            Float.NaN
-          </td>
-        </tr>
-    </table>
+  - In Spark 3.0, the `Cast` function processes string literals such as 
'Infinity', '+Infinity', '-Infinity', 'NaN', 'Inf', '+Inf', '-Inf' in a 
case-insensitive manner when casting the literals to `Double` or `Float` type 
to ensure greater compatibility with other database systems. This behavior 
change is illustrated in the table below:
+
+    | Operation | Result before Spark 3.0 | Result in Spark 3.0 |
+    | --------- | ----------------------- | ------------------- |
+    | CAST('infinity' AS DOUBLE) | NULL | Double.PositiveInfinity |
+    | CAST('+infinity' AS DOUBLE) | NULL | Double.PositiveInfinity |
+    | CAST('inf' AS DOUBLE) | NULL | Double.PositiveInfinity |
+    | CAST('inf' AS DOUBLE) | NULL | Double.PositiveInfinity |
+    | CAST('-infinity' AS DOUBLE) | NULL | Double.NegativeInfinity |
+    | CAST('-inf' AS DOUBLE) | NULL | Double.NegativeInfinity |
+    | CAST('infinity' AS FLOAT) | NULL | Float.PositiveInfinity |
+    | CAST('+infinity' AS FLOAT) | NULL | Float.PositiveInfinity |
+    | CAST('inf' AS FLOAT) | NULL | Float.PositiveInfinity |
+    | CAST('+inf' AS FLOAT) | NULL | Float.PositiveInfinity |
+    | CAST('-infinity' AS FLOAT) | NULL | Float.NegativeInfinity |
+    | CAST('-inf' AS FLOAT) | NULL | Float.NegativeInfinity |
+    | CAST('nan' AS DOUBLE) | NULL | Double.Nan |
+    | CAST('nan' AS FLOAT) | NULL | Float.NaN |
+
+  - In Spark 3.0, when casting interval values to string type, there is no 
"interval" prefix, for example, `1 days 2 hours`. In Spark version 2.4 and 
below, the string contains the "interval" prefix like `interval 1 days 2 hours`.
+
+  - In Spark 3.0, when casting string value to integral types(tinyint, 
smallint, int and bigint), datetime types(date, timestamp and interval) and 
boolean type, the leading and trailing whitespaces (<= ASCII 32) will be 
trimmed before converted to these type values, for example, `cast(' 1\t' as 
int)` results `1`, `cast(' 1\t' as boolean)` results `true`, 
`cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark 
version 2.4 and below, when casting string to integrals and booleans, it does 
not trim the whitespaces from both ends; the foregoing results is `null`, while 
to datetimes, only the trailing spaces (= ASCII 32) are removed.
 
-  - Since Spark 3.0, when casting interval values to string type, there is no 
"interval" prefix, e.g. `1 days 2 hours`. In Spark version 2.4 and earlier, the 
string contains the "interval" prefix like `interval 1 days 2 hours`.
+### Query Engine
 
-  - Since Spark 3.0, when casting string value to integral types(tinyint, 
smallint, int and bigint), datetime types(date, timestamp and interval) and 
boolean type, the leading and trailing whitespaces (<= ASCII 32) will be 
trimmed before converted to these type values, e.g. `cast(' 1\t' as int)` 
results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as 
date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, 
while casting string to integrals and booleans, it will not trim the 
whitespaces from both ends, the foregoing results will be `null`, while to 
datetimes, only the trailing spaces (= ASCII 32) will be removed.
+  - In Spark version 2.4 and below, SQL queries such as `FROM <table>` or 
`FROM <table> UNION ALL FROM <table>` are supported by accident. In hive-style 
`FROM <table> SELECT <expr>`, the `SELECT` clause is not negligible. Neither 
Hive nor Presto support this syntax. These queries are treated as invalid in 
Spark 3.0.
 
-### Query Engine
+  - In Spark 3.0, the interval literal syntax does not allow multiple from-to 
units anymore. For example, `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' YEAR TO 
MONTH'` throws parser exception.
 
-  - In Spark version 2.4 and earlier, SQL queries such as `FROM <table>` or 
`FROM <table> UNION ALL FROM <table>` are supported by accident. In hive-style 
`FROM <table> SELECT <expr>`, the `SELECT` clause is not negligible. Neither 
Hive nor Presto support this syntax. Therefore we will treat these queries as 
invalid since Spark 3.0.
+  - In Spark 3.0, numbers written in scientific notation(for example, `1E2`) 
would be parsed as Double. In Spark version 2.4 and below, they're parsed as 
Decimal. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.exponentLiteralAsDecimal.enabled` to `true`.
 
-  - Since Spark 3.0, the interval literal syntax does not allow multiple 
from-to units anymore. For example, `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' 
YEAR TO MONTH'` throws parser exception.
+  - In Spark 3.0, day-time interval strings are converted to intervals with 
respect to the `from` and `to` bounds. If an input string does not match to the 
pattern defined by specified bounds, the `ParseException` exception is thrown. 
For example, `interval '2 10:20' hour to minute` raises the exception because 
the expected format is `[+|-]h[h]:[m]m`. In Spark version 2.4, the `from` bound 
was not taken into account, and the `to` bound was used to truncate the 
resulted interval. For instance, the day-time interval string from the showed 
example is converted to `interval 10 hours 20 minutes`. To restore the behavior 
before Spark 3.0, you can set `spark.sql.legacy.fromDayTimeString.enabled` to 
`true`.
 
-  - Since Spark 3.0, numbers written in scientific notation(e.g. `1E2`) would 
be parsed as Double. In Spark version 2.4 and earlier, they're parsed as 
Decimal. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.exponentLiteralAsDecimal.enabled` to `true`.
+  - In Spark 3.0, negative scale of decimal is not allowed by default, for 
example, data type of literal like `1E10BD` is `DecimalType(11, 0)`. In Spark 
version 2.4 and below, it was `DecimalType(2, -9)`. To restore the behavior 
before Spark 3.0, you can set `spark.sql.legacy.allowNegativeScaleOfDecimal` to 
`true`.
 
-  - Since Spark 3.0, day-time interval strings are converted to intervals with 
respect to the `from` and `to` bounds. If an input string does not match to the 
pattern defined by specified bounds, the `ParseException` exception is thrown. 
For example, `interval '2 10:20' hour to minute` raises the exception because 
the expected format is `[+|-]h[h]:[m]m`. In Spark version 2.4, the `from` bound 
was not taken into account, and the `to` bound was used to truncate the 
resulted interval. For instance, the day-time interval string from the showed 
example is converted to `interval 10 hours 20 minutes`. To restore the behavior 
before Spark 3.0, you can set `spark.sql.legacy.fromDayTimeString.enabled` to 
`true`.
-  
-  - Since Spark 3.0, negative scale of decimal is not allowed by default, e.g. 
data type of literal like `1E10BD` is `DecimalType(11, 0)`. In Spark version 
2.4 and earlier, it was `DecimalType(2, -9)`. To restore the behavior before 
Spark 3.0, you can set `spark.sql.legacy.allowNegativeScaleOfDecimal` to `true`.
+  - In Spark 3.0, the unary arithmetic operator plus(`+`) only accepts string, 
numeric and interval type values as inputs. Besides, `+` with a integral string 
representation is coerced to a double value, for example, `+'1'` returns `1.0`. 
In Spark version 2.4 and below, this operator is ignored. There is no type 
checking for it, thus, all type values with a `+` prefix are valid, for 
example, `+ array(1, 2)` is valid and results `[1, 2]`. Besides, there is no 
type coercion for it at all, for example, in Spark 2.4, the result of `+'1'` is 
string `1`.
 
-  - Since Spark 3.0, the unary arithmetic operator plus(`+`) only accepts 
string, numeric and interval type values as inputs. Besides, `+` with a 
integral string representation will be coerced to double value, e.g. `+'1'` 
results `1.0`. In Spark version 2.4 and earlier, this operator is ignored. 
There is no type checking for it, thus, all type values with a `+` prefix are 
valid, e.g. `+ array(1, 2)` is valid and results `[1, 2]`. Besides, there is no 
type coercion for it at all, e.g. in Spark 2.4, the result of `+'1'` is string 
`1`.
+  - In Spark 3.0, Dataset query fails if it contains ambiguous column 
reference that is caused by self join. A typical example: `val df1 = ...; val 
df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an 
empty result which is quite confusing. This is because Spark cannot resolve 
Dataset column references that point to tables being self joined, and 
`df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior 
before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin` to 
`false`.
 
-  - Since Spark 3.0, Dataset query fails if it contains ambiguous column 
reference that is caused by self join. A typical example: `val df1 = ...; val 
df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an 
empty result which is quite confusing. This is because Spark cannot resolve 
Dataset column references that point to tables being self joined, and 
`df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior 
before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin` to 
`false`.
+  - In Spark 3.0, `spark.sql.legacy.ctePrecedencePolicy` is introduced to 
control the behavior for name conflicting in the nested WITH clause. By default 
value `EXCEPTION`, Spark throws an AnalysisException, it forces users to choose 
the specific substitution order they wanted. If set to `CORRECTED` (which is 
recommended), inner CTE definitions take precedence over outer definitions. For 
example, set the config to `false`, `WITH t AS (SELECT 1), t2 AS (WITH t AS 
(SELECT 2) SELECT * FROM t) SELECT * FROM t2` returns `2`, while setting it to 
`LEGACY`, the result is `1` which is the behavior in version 2.4 and below.
 
-  - Since Spark 3.0, `spark.sql.legacy.ctePrecedencePolicy` is introduced to 
control the behavior for name conflicting in the nested WITH clause. By default 
value `EXCEPTION`, Spark throws an AnalysisException, it forces users to choose 
the specific substitution order they wanted. If set to `CORRECTED` (which is 
recommended), inner CTE definitions take precedence over outer definitions. For 
example, set the config to `false`, `WITH t AS (SELECT 1), t2 AS (WITH t AS 
(SELECT 2) SELECT * FROM t) SELECT * FROM t2` returns `2`, while setting it to 
`LEGACY`, the result is `1` which is the behavior in version 2.4 and earlier.
+  - In Spark 3.0, configuration `spark.sql.crossJoin.enabled` become internal 
configuration, and is true by default, so by default spark won't raise 
exception on sql with implicit cross join.
 
-  - Since Spark 3.0, configuration `spark.sql.crossJoin.enabled` become 
internal configuration, and is true by default, so by default spark won't raise 
exception on sql with implicit cross join.
+  - In Spark version 2.4 and below, float/double -0.0 is semantically equal to 
0.0, but -0.0 and 0.0 are considered as different values when used in aggregate 
grouping keys, window partition keys, and join keys. In Spark 3.0, this bug is 
fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` returns 
`[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and below.
 
-  - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal 
to 0.0, but -0.0 and 0.0 are considered as different values when used in 
aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, 
this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` 
returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and 
earlier.
+  - In Spark version 2.4 and below, invalid time zone ids are silently ignored 
and replaced by GMT time zone, for example, in the from_utc_timestamp function. 
In Spark 3.0, such time zone ids are rejected, and Spark throws 
`java.time.DateTimeException`.
 
-  - In Spark version 2.4 and earlier, invalid time zone ids are silently 
ignored and replaced by GMT time zone, for example, in the from_utc_timestamp 
function. Since Spark 3.0, such time zone ids are rejected, and Spark throws 
`java.time.DateTimeException`.
+  - In Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, 
and converting dates and timestamps as well as in extracting sub-components 
like years, days and so on. Spark 3.0 uses Java 8 API classes from the 
`java.time` packages that are based on [ISO 
chronology](https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html).
 In Spark version 2.4 and below, those operations are performed using the 
hybrid calendar ([Julian + 
Gregorian](https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html).
 The changes impact on the results for dates before October 15, 1582 
(Gregorian) and affect on the following Spark 3.0 API:
 
-  - Since Spark 3.0, Proleptic Gregorian calendar is used in parsing, 
formatting, and converting dates and timestamps as well as in extracting 
sub-components like years, days and etc. Spark 3.0 uses Java 8 API classes from 
the java.time packages that based on ISO chronology 
(https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html).
 In Spark version 2.4 and earlier, those operations are performed by using the 
hybrid calendar (Julian + Gregorian, see 
https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html). 
The changes impact on the results for dates before October 15, 1582 (Gregorian) 
and affect on the following Spark 3.0 API:
+    * Parsing/formatting of timestamp/date strings. This effects on CSV/JSON 
datasources and on the `unix_timestamp`, `date_format`, `to_unix_timestamp`, 
`from_unixtime`, `to_date`, `to_timestamp` functions when patterns specified by 
users is used for parsing and formatting. In Spark 3.0, we define our own 
pattern strings in `sql-ref-datetime-pattern.md`, which is implemented via 
`java.time.format.DateTimeFormatter` under the hood. New implementation 
performs strict checking of its input. For example, the `2015-07-22 10:00:00` 
timestamp cannot be parse if pattern is `yyyy-MM-dd` because the parser does 
not consume whole input. Another example is the `31/01/2015 00:00` input cannot 
be parsed by the `dd/MM/yyyy hh:mm` pattern because `hh` supposes hours in the 
range `1-12`. In Spark version 2.4 and below, `java.text.SimpleDateFormat` is 
used for timestamp/date string conversions, and the supported patterns are 
described in 
[simpleDateFormat](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html).
 The old behavior can be restored by setting 
`spark.sql.legacy.timeParserPolicy` to `LEGACY`.
 
 Review comment:
   This doc majorly focuses on the behavior changes. Performance-related 
numbers are sensitive to the workloads and environments. Not sure how 
significant it is after these PRs. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] gatorsmile commented on a change in pull request #28125: [SPARK-31351][DOC] Migration Guide Auditing for Spark 3.0 Release

Reply via email to