[spark] branch branch-3.0 updated: [SPARK-31151][SQL][DOC] Reorganize the migration guide of SQL

yamamuro Sat, 14 Mar 2020 15:38:51 -0700

This is an automated email from the ASF dual-hosted git repository.

yamamuro pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new f83ef7d  [SPARK-31151][SQL][DOC] Reorganize the migration guide of SQL
f83ef7d is described below

commit f83ef7d143aafbbdd1bb322567481f68db72195a
Author: gatorsmile <gatorsm...@gmail.com>
AuthorDate: Sun Mar 15 07:35:20 2020 +0900

    [SPARK-31151][SQL][DOC] Reorganize the migration guide of SQL
    
    ### What changes were proposed in this pull request?
    The current migration guide of SQL is too long for most readers to find the 
needed info. This PR is to group the items in the migration guide of Spark SQL 
based on the corresponding components.
    
    Note. This PR does not change the contents of the migration guides. 
Attached figure is the screenshot after the change.
    
    
![screencapture-127-0-0-1-4000-sql-migration-guide-html-2020-03-14-12_00_40](https://user-images.githubusercontent.com/11567269/76688626-d3010200-65eb-11ea-9ce7-265bc90ebb2c.png)
    
    ### Why are the changes needed?
    The current migration guide of SQL is too long for most readers to find the 
needed info.
    
    ### Does this PR introduce any user-facing change?
    No
    
    ### How was this patch tested?
    N/A
    
    Closes #27909 from gatorsmile/migrationGuideReorg.
    
    Authored-by: gatorsmile <gatorsm...@gmail.com>
    Signed-off-by: Takeshi Yamamuro <yamam...@apache.org>
    (cherry picked from commit 4d4c00c1b564b57d3016ce8c3bfcffaa6e58f012)
    Signed-off-by: Takeshi Yamamuro <yamam...@apache.org>
---
 docs/sql-migration-guide.md | 287 +++++++++++++++++++++++---------------------
 1 file changed, 150 insertions(+), 137 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 19c744c..31d5c68 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -23,92 +23,119 @@ license: |
 {:toc}
 
 ## Upgrading from Spark SQL 2.4 to 3.0
-  - Since Spark 3.0, when inserting a value into a table column with a 
different data type, the type coercion is performed as per ANSI SQL standard. 
Certain unreasonable type conversions such as converting `string` to `int` and 
`double` to `boolean` are disallowed. A runtime exception will be thrown if the 
value is out-of-range for the data type of the column. In Spark version 2.4 and 
earlier, type conversions during table insertion are allowed as long as they 
are valid `Cast`. When inse [...]
 
-  - In Spark 3.0, the deprecated methods `SQLContext.createExternalTable` and 
`SparkSession.createExternalTable` have been removed in favor of its 
replacement, `createTable`.
-
-  - In Spark 3.0, the deprecated `HiveContext` class has been removed. Use 
`SparkSession.builder.enableHiveSupport()` instead.
-
-  - Since Spark 3.0, configuration `spark.sql.crossJoin.enabled` become 
internal configuration, and is true by default, so by default spark won't raise 
exception on sql with implicit cross join.
-
-  - In Spark version 2.4 and earlier, SQL queries such as `FROM <table>` or 
`FROM <table> UNION ALL FROM <table>` are supported by accident. In hive-style 
`FROM <table> SELECT <expr>`, the `SELECT` clause is not negligible. Neither 
Hive nor Presto support this syntax. Therefore we will treat these queries as 
invalid since Spark 3.0.
+### Dataset/DataFrame APIs
 
   - Since Spark 3.0, the Dataset and DataFrame API `unionAll` is not 
deprecated any more. It is an alias for `union`.
 
-  - In Spark version 2.4 and earlier, the parser of JSON data source treats 
empty strings as null for some data types such as `IntegerType`. For 
`FloatType`, `DoubleType`, `DateType` and `TimestampType`, it fails on empty 
strings and throws exceptions. Since Spark 3.0, we disallow empty strings and 
will throw exceptions for data types except for `StringType` and `BinaryType`. 
The previous behaviour of allowing empty string can be restored by setting 
`spark.sql.legacy.json.allowEmptyStrin [...]
-
-  - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
-
-  - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
-
-  - In Spark version 2.4 and earlier, users can create map values with map 
type key via built-in function such as `CreateMap`, `MapFromArrays`, etc. Since 
Spark 3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can use `map_entries` function to convert map to 
array<struct<key, value>> as a workaround. In addition, users can still read 
map values with map type key from data source or Java/Scala collections, though 
it is discouraged.
-
   - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a 
grouped dataset with key attribute wrongly named as "value", if the key is 
non-struct type, e.g. int, string, array, etc. This is counterintuitive and 
makes the schema of aggregation queries weird. For example, the schema of 
`ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the 
grouping attribute to "key". The old behaviour is preserved under a newly added 
configuration `spark.sql.legacy.data [...]
 
-  - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal 
to 0.0, but -0.0 and 0.0 are considered as different values when used in 
aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, 
this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` 
returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and 
earlier.
-
-  - In Spark version 2.4 and earlier, users can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, e.g. map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, Spark 
will throw RuntimeException while duplicated keys are found. Users can set 
`spark.sql.mapKeyDedupPolicy` to L [...]
+### DDL Statements
 
-  - In Spark version 2.4 and earlier, partition column value is converted as 
null if it can't be casted to corresponding user provided schema. Since 3.0, 
partition column value is validated with user provided schema. An exception is 
thrown if the validation fails. You can disable such validation by setting 
`spark.sql.sources.validatePartitionColumns` to `false`.
+  - Since Spark 3.0, `CREATE TABLE` without a specific provider will use the 
value of `spark.sql.sources.default` as its provider. In Spark version 2.4 and 
earlier, it was hive. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.
 
-  - In Spark version 2.4 and earlier, the `SET` command works without any 
warnings even if the specified key is for `SparkConf` entries and it has no 
effect because the command does not update `SparkConf`, but the behavior might 
confuse users. Since 3.0, the command fails if a `SparkConf` key is used. You 
can disable such a check by setting 
`spark.sql.legacy.setCommandRejectsSparkCoreConfs` to `false`.
+  - Since Spark 3.0, when inserting a value into a table column with a 
different data type, the type coercion is performed as per ANSI SQL standard. 
Certain unreasonable type conversions such as converting `string` to `int` and 
`double` to `boolean` are disallowed. A runtime exception will be thrown if the 
value is out-of-range for the data type of the column. In Spark version 2.4 and 
earlier, type conversions during table insertion are allowed as long as they 
are valid `Cast`. When inse [...]
 
-  - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV 
string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, the 
returned row can contain non-`null` fields if some of CSV column values were 
parsed and converted to desired types successfully.
+  - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
-  - In Spark version 2.4 and earlier, JSON datasource and JSON functions like 
`from_json` convert a bad JSON record to a row with all `null`s in the 
PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, the 
returned row can contain non-`null` fields if some of JSON column values were 
parsed and converted to desired types successfully.
+  - In Spark version 2.4 and earlier, the `SET` command works without any 
warnings even if the specified key is for `SparkConf` entries and it has no 
effect because the command does not update `SparkConf`, but the behavior might 
confuse users. Since 3.0, the command fails if a `SparkConf` key is used. You 
can disable such a check by setting 
`spark.sql.legacy.setCommandRejectsSparkCoreConfs` to `false`.
 
   - Refreshing a cached table would trigger a table uncache operation and then 
a table cache (lazily) operation. In Spark version 2.4 and earlier, the cache 
name and storage level are not preserved before the uncache operation. 
Therefore, the cache name and storage level could be changed unexpectedly. 
Since Spark 3.0, cache name and storage level will be first preserved for cache 
recreation. It helps to maintain a consistent cache behavior upon table 
refreshing.
 
-  - Since Spark 3.0, JSON datasource and JSON function `schema_of_json` infer 
TimestampType from string values if they match to the pattern defined by the 
JSON option `timestampFormat`. Set JSON option `inferTimestamp` to `false` to 
disable such type inferring.
-
-  - Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, 
DataType)` is not allowed by default. Set 
`spark.sql.legacy.allowUntypedScalaUDF` to true to keep using it. But please 
note that, in Spark version 2.4 and earlier, if 
`org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure 
with primitive-type argument, the returned UDF will return null if the input 
values is null. However, since Spark 3.0, the UDF will return the default value 
of the Java type if t [...]
-
-  - Since Spark 3.0, Proleptic Gregorian calendar is used in parsing, 
formatting, and converting dates and timestamps as well as in extracting 
sub-components like years, days and etc. Spark 3.0 uses Java 8 API classes from 
the java.time packages that based on ISO chronology 
(https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html).
 In Spark version 2.4 and earlier, those operations are performed by using the 
hybrid calendar (Julian + Gregorian, see https://docs.orac [...]
-
-    - Parsing/formatting of timestamp/date strings. This effects on CSV/JSON 
datasources and on the `unix_timestamp`, `date_format`, `to_unix_timestamp`, 
`from_unixtime`, `to_date`, `to_timestamp` functions when patterns specified by 
users is used for parsing and formatting. Since Spark 3.0, we define our own 
pattern strings in `sql-ref-datetime-pattern.md`, which is implemented via 
`java.time.format.DateTimeFormatter` under the hood. New implementation 
performs strict checking of its in [...]
-
-    - The `weekofyear`, `weekday`, `dayofweek`, `date_trunc`, 
`from_utc_timestamp`, `to_utc_timestamp`, and `unix_timestamp` functions use 
java.time API for calculation week number of year, day number of week as well 
for conversion from/to TimestampType values in UTC time zone.
-
-    - the JDBC options `lowerBound` and `upperBound` are converted to 
TimestampType/DateType values in the same way as casting strings to 
TimestampType/DateType values. The conversion is based on Proleptic Gregorian 
calendar, and time zone defined by the SQL config `spark.sql.session.timeZone`. 
In Spark version 2.4 and earlier, the conversion is based on the hybrid 
calendar (Julian + Gregorian) and on default system time zone.
+  - Since Spark 3.0, the properties listing below become reserved, commands 
will fail if we specify reserved properties in places like `CREATE DATABASE ... 
WITH DBPROPERTIES` and `ALTER TABLE ... SET TBLPROPERTIES`. We need their 
specific clauses to specify them, e.g. `CREATE DATABASE test COMMENT 'any 
comment' LOCATION 'some path'`. We can set 
`spark.sql.legacy.notReserveProperties` to `true` to ignore the 
`ParseException`, in this case, these properties will be silently removed, e.g 
`S [...]
+    <table class="table">
+        <tr>
+          <th>
+            <b>Property(case sensitive)</b>
+          </th>
+          <th>
+            <b>Database Reserved</b>
+          </th>
+          <th>
+            <b>Table Reserved</b>
+          </th>
+          <th>
+            <b>Remarks</b>
+          </th>
+        </tr>
+        <tr>
+          <td>
+            provider
+          </td>
+          <td>
+            no
+          </td>
+          <td>
+            yes
+          </td>
+          <td>
+            For tables, please use the USING clause to specify it. Once set, 
it can't be changed.
+          </td>
+        </tr>
+        <tr>
+          <td>
+            location
+          </td>
+          <td>
+            yes
+          </td>
+          <td>
+            yes
+          </td>
+          <td>
+            For databases and tables, please use the LOCATION clause to 
specify it.
+          </td>
+        </tr>
+        <tr>
+          <td>
+            owner
+          </td>
+          <td>
+            yes
+          </td>
+          <td>
+            yes
+          </td>
+          <td>
+            For databases and tables, it is determined by the user who runs 
spark and create the table.
+          </td>
+        </tr>
+    </table>
 
-    - Formatting of `TIMESTAMP` and `DATE` literals.
+  - Since Spark 3.0, `ADD FILE` can be used to add file directories as well. 
Earlier only single files can be added using this command. To restore the 
behaviour of earlier versions, set `spark.sql.legacy.addSingleFileInAddFile` to 
`true`.
 
-    - Creating of typed `TIMESTAMP` and `DATE` literals from strings. Since 
Spark 3.0, string conversion to typed `TIMESTAMP`/`DATE` literals is performed 
via casting to `TIMESTAMP`/`DATE` values. For example, `TIMESTAMP '2019-12-23 
12:59:30'` is semantically equal to `CAST('2019-12-23 12:59:30' AS TIMESTAMP)`. 
When the input string does not contain information about time zone, the time 
zone from the SQL config `spark.sql.session.timeZone` is used in that case. In 
Spark version 2.4 and e [...]
+  - Since Spark 3.0, `SHOW TBLPROPERTIES` will cause `AnalysisException` if 
the table does not exist. In Spark version 2.4 and earlier, this scenario 
caused `NoSuchTableException`. Also, `SHOW TBLPROPERTIES` on a temporary view 
will cause `AnalysisException`. In Spark version 2.4 and earlier, it returned 
an empty result.
 
-  - In Spark version 2.4 and earlier, invalid time zone ids are silently 
ignored and replaced by GMT time zone, for example, in the from_utc_timestamp 
function. Since Spark 3.0, such time zone ids are rejected, and Spark throws 
`java.time.DateTimeException`.
+  - Since Spark 3.0, `SHOW CREATE TABLE` will always return Spark DDL, even 
when the given table is a Hive serde table. For generating Hive DDL, please use 
`SHOW CREATE TABLE AS SERDE` command instead.
 
-  - In Spark version 2.4 and earlier, the `current_timestamp` function returns 
a timestamp with millisecond resolution only. Since Spark 3.0, the function can 
return the result with microsecond resolution if the underlying clock available 
on the system offers such resolution.
+### UDFs and Built-in Functions
 
-  - In Spark version 2.4 and earlier, when reading a Hive Serde table with 
Spark native data sources(parquet/orc), Spark will infer the actual file schema 
and update the table schema in metastore. Since Spark 3.0, Spark doesn't infer 
the schema anymore. This should not cause any problems to end users, but if it 
does, please set `spark.sql.hive.caseSensitiveInferenceMode` to 
`INFER_AND_SAVE`.
+  - Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, 
smallint, tinyint as the 2nd argument, fractional and string types are not 
valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause 
`AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is 
fractional or string value, it will be coerced to int value, and the result 
will be a date value of `1964-06-04`.
 
-  - Since Spark 3.0, `TIMESTAMP` literals are converted to strings using the 
SQL config `spark.sql.session.timeZone`. In Spark version 2.4 and earlier, the 
conversion uses the default time zone of the Java virtual machine.
+  - Since Spark 3.0, the function `percentile_approx` and its alias 
`approx_percentile` only accept integral value with range in `[1, 2147483647]` 
as its 3rd argument `accuracy`, fractional and string types are disallowed, 
e.g. `percentile_approx(10.0, 0.2, 1.8D)` will cause `AnalysisException`. In 
Spark version 2.4 and earlier, if `accuracy` is fractional or string value, it 
will be coerced to an int value, `percentile_approx(10.0, 0.2, 1.8D)` is 
operated as `percentile_approx(10.0, 0.2 [...]
 
-  - In Spark version 2.4, when a spark session is created via 
`cloneSession()`, the newly created spark session inherits its configuration 
from its parent `SparkContext` even though the same configuration may exist 
with a different value in its parent spark session. Since Spark 3.0, the 
configurations of a parent `SparkSession` have a higher precedence over the 
parent `SparkContext`. The old behavior can be restored by setting 
`spark.sql.legacy.sessionInitWithConfigDefaults` to `true`.
+  - Since Spark 3.0, an analysis exception will be thrown when hash 
expressions are applied on elements of MapType. To restore the behavior before 
Spark 3.0, set `spark.sql.legacy.allowHashOnMapType` to `true`.
 
-  - Since Spark 3.0, parquet logical type `TIMESTAMP_MICROS` is used by 
default while saving `TIMESTAMP` columns. In Spark version 2.4 and earlier, 
`TIMESTAMP` columns are saved as `INT96` in parquet files. Note that, some SQL 
systems such as Hive 1.x and Impala 2.x can only read `INT96` timestamps, you 
can set `spark.sql.parquet.outputTimestampType` as `INT96` to restore the 
previous behavior and keep interoperability.
+  - Since Spark 3.0, when the `array`/`map` function is called without any 
parameters, it returns an empty collection with `NullType` as element type. In 
Spark version 2.4 and earlier, it returns an empty collection with `StringType` 
as element type. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.createEmptyCollectionUsingStringType` to `true`.
 
-  - Since Spark 3.0, if `hive.default.fileformat` is not found in `Spark SQL 
configuration` then it will fallback to hive-site.xml present in the `Hadoop 
configuration` of `SparkContext`.
+  - Since Spark 3.0, the `from_json` functions supports two modes - 
`PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The 
default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` 
did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing 
of malformed JSON records. For example, the JSON string `{"a" 1}` with the 
schema `a INT` is converted to `null` by previous versions but Spark 3.0 
converts it to `Row(null)`.
 
-  - Since Spark 3.0, Spark will cast `String` to `Date/TimeStamp` in binary 
comparisons with dates/timestamps. The previous behaviour of casting 
`Date/Timestamp` to `String` can be restored by setting 
`spark.sql.legacy.typeCoercion.datetimeToString.enabled` to `true`.
+  - In Spark version 2.4 and earlier, users can create map values with map 
type key via built-in function such as `CreateMap`, `MapFromArrays`, etc. Since 
Spark 3.0, it's not allowed to create map values with map type key with these 
built-in functions. Users can use `map_entries` function to convert map to 
array<struct<key, value>> as a workaround. In addition, users can still read 
map values with map type key from data source or Java/Scala collections, though 
it is discouraged.
 
-  - Since Spark 3.0, when Avro files are written with user provided schema, 
the fields will be matched by field names between catalyst schema and avro 
schema instead of positions.
+  - In Spark version 2.4 and earlier, users can create a map with duplicated 
keys via built-in functions like `CreateMap`, `StringToMap`, etc. The behavior 
of map with duplicated keys is undefined, e.g. map look up respects the 
duplicated key appears first, `Dataset.collect` only keeps the duplicated key 
appears last, `MapKeys` returns duplicated keys, etc. Since Spark 3.0, Spark 
will throw RuntimeException while duplicated keys are found. Users can set 
`spark.sql.mapKeyDedupPolicy` to L [...]
 
-  - Since Spark 3.0, when Avro files are written with user provided 
non-nullable schema, even the catalyst schema is nullable, Spark is still able 
to write the files. However, Spark will throw runtime NPE if any of the records 
contains null.
+  - Since Spark 3.0, using `org.apache.spark.sql.functions.udf(AnyRef, 
DataType)` is not allowed by default. Set 
`spark.sql.legacy.allowUntypedScalaUDF` to true to keep using it. But please 
note that, in Spark version 2.4 and earlier, if 
`org.apache.spark.sql.functions.udf(AnyRef, DataType)` gets a Scala closure 
with primitive-type argument, the returned UDF will return null if the input 
values is null. However, since Spark 3.0, the UDF will return the default value 
of the Java type if t [...]
 
   - Since Spark 3.0, a higher-order function `exists` follows the three-valued 
boolean logic, i.e., if the `predicate` returns any `null`s and no `true` is 
obtained, then `exists` will return `null` instead of `false`. For example, 
`exists(array(1, null, 3), x -> x % 2 == 0)` will be `null`. The previous 
behaviour can be restored by setting 
`spark.sql.legacy.followThreeValuedLogicInArrayExists` to `false`.
 
-  - Since Spark 3.0, if files or subdirectories disappear during recursive 
directory listing (i.e. they appear in an intermediate listing but then cannot 
be read or listed during later phases of the recursive directory listing, due 
to either concurrent file deletions or object store consistency issues) then 
the listing will fail with an exception unless 
`spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous 
versions, these missing files or subdirectories would be i [...]
-
-  - Since Spark 3.0, `spark.sql.legacy.ctePrecedencePolicy` is introduced to 
control the behavior for name conflicting in the nested WITH clause. By default 
value `EXCEPTION`, Spark throws an AnalysisException, it forces users to choose 
the specific substitution order they wanted. If set to `CORRECTED` (which is 
recommended), inner CTE definitions take precedence over outer definitions. For 
example, set the config to `false`, `WITH t AS (SELECT 1), t2 AS (WITH t AS 
(SELECT 2) SELECT * FR [...]
-
   - Since Spark 3.0, the `add_months` function does not adjust the resulting 
date to a last day of month if the original date is a last day of months. For 
example, `select add_months(DATE'2019-02-28', 1)` results `2019-03-28`. In 
Spark version 2.4 and earlier, the resulting date is adjusted when the original 
date is a last day of months. For example, adding a month to `2019-02-28` 
results in `2019-03-31`.
 
+  - In Spark version 2.4 and earlier, the `current_timestamp` function returns 
a timestamp with millisecond resolution only. Since Spark 3.0, the function can 
return the result with microsecond resolution if the underlying clock available 
on the system offers such resolution.
+
   - Since Spark 3.0, 0-argument Java UDF is executed in the executor side 
identically with other UDFs. In Spark version 2.4 and earlier, 0-argument Java 
UDF alone was executed in the driver side, and the result was propagated to 
executors, which might be more performant in some cases but caused 
inconsistency with a correctness issue in some cases.
 
   - The result of `java.lang.Math`'s `log`, `log1p`, `exp`, `expm1`, and `pow` 
may vary across platforms. In Spark 3.0, the result of the equivalent SQL 
functions (including related SQL functions like `LOG10`) return values 
consistent with `java.lang.StrictMath`. In virtually all cases this makes no 
difference in the return value, and the difference is very small, but may not 
exactly match `java.lang.Math` on x86 platforms in cases like, for example, 
`log(3.0)`, whose value varies betwee [...]
 
-  - Since Spark 3.0, Dataset query fails if it contains ambiguous column 
reference that is caused by self join. A typical example: `val df1 = ...; val 
df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an 
empty result which is quite confusing. This is because Spark cannot resolve 
Dataset column references that point to tables being self joined, and 
`df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior 
before Spark 3.0, you can set `spark.s [...]
-
   - Since Spark 3.0, `Cast` function processes string literals such as 
'Infinity', '+Infinity', '-Infinity', 'NaN', 'Inf', '+Inf', '-Inf' in case 
insensitive manner when casting the literals to `Double` or `Float` type to 
ensure greater compatibility with other database systems. This behaviour change 
is illustrated in the table below:
     <table class="table">
         <tr>
@@ -198,6 +225,50 @@ license: |
         </tr>
     </table>
 
+  - Since Spark 3.0, when casting interval values to string type, there is no 
"interval" prefix, e.g. `1 days 2 hours`. In Spark version 2.4 and earlier, the 
string contains the "interval" prefix like `interval 1 days 2 hours`.
+
+  - Since Spark 3.0, when casting string value to integral types(tinyint, 
smallint, int and bigint), datetime types(date, timestamp and interval) and 
boolean type, the leading and trailing whitespaces (<= ASCII 32) will be 
trimmed before converted to these type values, e.g. `cast(' 1\t' as int)` 
results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as 
date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, 
while casting string to integrals and b [...]
+
+### Query Engine
+
+  - In Spark version 2.4 and earlier, SQL queries such as `FROM <table>` or 
`FROM <table> UNION ALL FROM <table>` are supported by accident. In hive-style 
`FROM <table> SELECT <expr>`, the `SELECT` clause is not negligible. Neither 
Hive nor Presto support this syntax. Therefore we will treat these queries as 
invalid since Spark 3.0.
+
+  - Since Spark 3.0, the interval literal syntax does not allow multiple 
from-to units anymore. For example, `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' 
YEAR TO MONTH'` throws parser exception.
+
+  - Since Spark 3.0, numbers written in scientific notation(e.g. `1E2`) would 
be parsed as Double. In Spark version 2.4 and earlier, they're parsed as 
Decimal. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.exponentLiteralAsDecimal.enabled` to `true`.
+
+  - Since Spark 3.0, day-time interval strings are converted to intervals with 
respect to the `from` and `to` bounds. If an input string does not match to the 
pattern defined by specified bounds, the `ParseException` exception is thrown. 
For example, `interval '2 10:20' hour to minute` raises the exception because 
the expected format is `[+|-]h[h]:[m]m`. In Spark version 2.4, the `from` bound 
was not taken into account, and the `to` bound was used to truncate the 
resulted interval. For i [...]
+  
+  - Since Spark 3.0, negative scale of decimal is not allowed by default, e.g. 
data type of literal like `1E10BD` is `DecimalType(11, 0)`. In Spark version 
2.4 and earlier, it was `DecimalType(2, -9)`. To restore the behavior before 
Spark 3.0, you can set `spark.sql.legacy.allowNegativeScaleOfDecimal` to `true`.
+
+  - Since Spark 3.0, the unary arithmetic operator plus(`+`) only accepts 
string, numeric and interval type values as inputs. Besides, `+` with a 
integral string representation will be coerced to double value, e.g. `+'1'` 
results `1.0`. In Spark version 2.4 and earlier, this operator is ignored. 
There is no type checking for it, thus, all type values with a `+` prefix are 
valid, e.g. `+ array(1, 2)` is valid and results `[1, 2]`. Besides, there is no 
type coercion for it at all, e.g. in  [...]
+
+  - Since Spark 3.0, Dataset query fails if it contains ambiguous column 
reference that is caused by self join. A typical example: `val df1 = ...; val 
df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an 
empty result which is quite confusing. This is because Spark cannot resolve 
Dataset column references that point to tables being self joined, and 
`df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior 
before Spark 3.0, you can set `spark.s [...]
+
+  - Since Spark 3.0, `spark.sql.legacy.ctePrecedencePolicy` is introduced to 
control the behavior for name conflicting in the nested WITH clause. By default 
value `EXCEPTION`, Spark throws an AnalysisException, it forces users to choose 
the specific substitution order they wanted. If set to `CORRECTED` (which is 
recommended), inner CTE definitions take precedence over outer definitions. For 
example, set the config to `false`, `WITH t AS (SELECT 1), t2 AS (WITH t AS 
(SELECT 2) SELECT * FR [...]
+
+  - Since Spark 3.0, configuration `spark.sql.crossJoin.enabled` become 
internal configuration, and is true by default, so by default spark won't raise 
exception on sql with implicit cross join.
+
+  - In Spark version 2.4 and earlier, float/double -0.0 is semantically equal 
to 0.0, but -0.0 and 0.0 are considered as different values when used in 
aggregate grouping keys, window partition keys and join keys. Since Spark 3.0, 
this bug is fixed. For example, `Seq(-0.0, 0.0).toDF("d").groupBy("d").count()` 
returns `[(0.0, 2)]` in Spark 3.0, and `[(0.0, 1), (-0.0, 1)]` in Spark 2.4 and 
earlier.
+
+  - In Spark version 2.4 and earlier, invalid time zone ids are silently 
ignored and replaced by GMT time zone, for example, in the from_utc_timestamp 
function. Since Spark 3.0, such time zone ids are rejected, and Spark throws 
`java.time.DateTimeException`.
+
+  - Since Spark 3.0, Proleptic Gregorian calendar is used in parsing, 
formatting, and converting dates and timestamps as well as in extracting 
sub-components like years, days and etc. Spark 3.0 uses Java 8 API classes from 
the java.time packages that based on ISO chronology 
(https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html).
 In Spark version 2.4 and earlier, those operations are performed by using the 
hybrid calendar (Julian + Gregorian, see https://docs.orac [...]
+
+    - Parsing/formatting of timestamp/date strings. This effects on CSV/JSON 
datasources and on the `unix_timestamp`, `date_format`, `to_unix_timestamp`, 
`from_unixtime`, `to_date`, `to_timestamp` functions when patterns specified by 
users is used for parsing and formatting. Since Spark 3.0, we define our own 
pattern strings in `sql-ref-datetime-pattern.md`, which is implemented via 
`java.time.format.DateTimeFormatter` under the hood. New implementation 
performs strict checking of its in [...]
+
+    - The `weekofyear`, `weekday`, `dayofweek`, `date_trunc`, 
`from_utc_timestamp`, `to_utc_timestamp`, and `unix_timestamp` functions use 
java.time API for calculation week number of year, day number of week as well 
for conversion from/to TimestampType values in UTC time zone.
+
+    - the JDBC options `lowerBound` and `upperBound` are converted to 
TimestampType/DateType values in the same way as casting strings to 
TimestampType/DateType values. The conversion is based on Proleptic Gregorian 
calendar, and time zone defined by the SQL config `spark.sql.session.timeZone`. 
In Spark version 2.4 and earlier, the conversion is based on the hybrid 
calendar (Julian + Gregorian) and on default system time zone.
+
+    - Formatting of `TIMESTAMP` and `DATE` literals.
+
+    - Creating of typed `TIMESTAMP` and `DATE` literals from strings. Since 
Spark 3.0, string conversion to typed `TIMESTAMP`/`DATE` literals is performed 
via casting to `TIMESTAMP`/`DATE` values. For example, `TIMESTAMP '2019-12-23 
12:59:30'` is semantically equal to `CAST('2019-12-23 12:59:30' AS TIMESTAMP)`. 
When the input string does not contain information about time zone, the time 
zone from the SQL config `spark.sql.session.timeZone` is used in that case. In 
Spark version 2.4 and e [...]
+
+  - Since Spark 3.0, `TIMESTAMP` literals are converted to strings using the 
SQL config `spark.sql.session.timeZone`. In Spark version 2.4 and earlier, the 
conversion uses the default time zone of the Java virtual machine.
+
+  - Since Spark 3.0, Spark will cast `String` to `Date/TimeStamp` in binary 
comparisons with dates/timestamps. The previous behaviour of casting 
`Date/Timestamp` to `String` can be restored by setting 
`spark.sql.legacy.typeCoercion.datetimeToString.enabled` to `true`.
+
   - Since Spark 3.0, special values are supported in conversion from strings 
to dates and timestamps. Those values are simply notational shorthands that 
will be converted to ordinary date or timestamp values when read. The following 
string values are supported for dates:
     - `epoch [zoneId]` - 1970-01-01
     - `today [zoneId]` - the current date in the time zone specified by 
`spark.sql.session.timeZone`
@@ -212,17 +283,37 @@ license: |
     - `now` - current query start time
   For example `SELECT timestamp 'tomorrow';`.
 
-  - Since Spark 3.0, when the `array`/`map` function is called without any 
parameters, it returns an empty collection with `NullType` as element type. In 
Spark version 2.4 and earlier, it returns an empty collection with `StringType` 
as element type. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.createEmptyCollectionUsingStringType` to `true`.
+### Data Sources
 
-  - Since Spark 3.0, the interval literal syntax does not allow multiple 
from-to units anymore. For example, `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' 
YEAR TO MONTH'` throws parser exception.
+  - In Spark version 2.4 and earlier, when reading a Hive Serde table with 
Spark native data sources(parquet/orc), Spark will infer the actual file schema 
and update the table schema in metastore. Since Spark 3.0, Spark doesn't infer 
the schema anymore. This should not cause any problems to end users, but if it 
does, please set `spark.sql.hive.caseSensitiveInferenceMode` to 
`INFER_AND_SAVE`.
 
-  - Since Spark 3.0, when casting interval values to string type, there is no 
"interval" prefix, e.g. `1 days 2 hours`. In Spark version 2.4 and earlier, the 
string contains the "interval" prefix like `interval 1 days 2 hours`.
+  - In Spark version 2.4 and earlier, partition column value is converted as 
null if it can't be casted to corresponding user provided schema. Since 3.0, 
partition column value is validated with user provided schema. An exception is 
thrown if the validation fails. You can disable such validation by setting 
`spark.sql.sources.validatePartitionColumns` to `false`.
 
-  - Since Spark 3.0, when casting string value to integral types(tinyint, 
smallint, int and bigint), datetime types(date, timestamp and interval) and 
boolean type, the leading and trailing whitespaces (<= ASCII 32) will be 
trimmed before converted to these type values, e.g. `cast(' 1\t' as int)` 
results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as 
date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, 
while casting string to integrals and b [...]
+  - Since Spark 3.0, if files or subdirectories disappear during recursive 
directory listing (i.e. they appear in an intermediate listing but then cannot 
be read or listed during later phases of the recursive directory listing, due 
to either concurrent file deletions or object store consistency issues) then 
the listing will fail with an exception unless 
`spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous 
versions, these missing files or subdirectories would be i [...]
 
-  - Since Spark 3.0, an analysis exception will be thrown when hash 
expressions are applied on elements of MapType. To restore the behavior before 
Spark 3.0, set `spark.sql.legacy.allowHashOnMapType` to `true`.
-    
-  - Since Spark 3.0, numbers written in scientific notation(e.g. `1E2`) would 
be parsed as Double. In Spark version 2.4 and earlier, they're parsed as 
Decimal. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.exponentLiteralAsDecimal.enabled` to `true`.
+  - In Spark version 2.4 and earlier, the parser of JSON data source treats 
empty strings as null for some data types such as `IntegerType`. For 
`FloatType`, `DoubleType`, `DateType` and `TimestampType`, it fails on empty 
strings and throws exceptions. Since Spark 3.0, we disallow empty strings and 
will throw exceptions for data types except for `StringType` and `BinaryType`. 
The previous behaviour of allowing empty string can be restored by setting 
`spark.sql.legacy.json.allowEmptyStrin [...]
+
+  - In Spark version 2.4 and earlier, JSON datasource and JSON functions like 
`from_json` convert a bad JSON record to a row with all `null`s in the 
PERMISSIVE mode when specified schema is `StructType`. Since Spark 3.0, the 
returned row can contain non-`null` fields if some of JSON column values were 
parsed and converted to desired types successfully.
+
+  - Since Spark 3.0, JSON datasource and JSON function `schema_of_json` infer 
TimestampType from string values if they match to the pattern defined by the 
JSON option `timestampFormat`. Set JSON option `inferTimestamp` to `false` to 
disable such type inferring.
+
+  - In Spark version 2.4 and earlier, CSV datasource converts a malformed CSV 
string to a row with all `null`s in the PERMISSIVE mode. Since Spark 3.0, the 
returned row can contain non-`null` fields if some of CSV column values were 
parsed and converted to desired types successfully.
+
+  - Since Spark 3.0, parquet logical type `TIMESTAMP_MICROS` is used by 
default while saving `TIMESTAMP` columns. In Spark version 2.4 and earlier, 
`TIMESTAMP` columns are saved as `INT96` in parquet files. Note that, some SQL 
systems such as Hive 1.x and Impala 2.x can only read `INT96` timestamps, you 
can set `spark.sql.parquet.outputTimestampType` as `INT96` to restore the 
previous behavior and keep interoperability.
+
+  - Since Spark 3.0, when Avro files are written with user provided schema, 
the fields will be matched by field names between catalyst schema and avro 
schema instead of positions.
+
+  - Since Spark 3.0, when Avro files are written with user provided 
non-nullable schema, even the catalyst schema is nullable, Spark is still able 
to write the files. However, Spark will throw runtime NPE if any of the records 
contains null.
+
+### Others
+
+  - In Spark 3.0, the deprecated methods `SQLContext.createExternalTable` and 
`SparkSession.createExternalTable` have been removed in favor of its 
replacement, `createTable`.
+
+  - In Spark 3.0, the deprecated `HiveContext` class has been removed. Use 
`SparkSession.builder.enableHiveSupport()` instead.
+
+  - In Spark version 2.4, when a spark session is created via 
`cloneSession()`, the newly created spark session inherits its configuration 
from its parent `SparkContext` even though the same configuration may exist 
with a different value in its parent spark session. Since Spark 3.0, the 
configurations of a parent `SparkSession` have a higher precedence over the 
parent `SparkContext`. The old behavior can be restored by setting 
`spark.sql.legacy.sessionInitWithConfigDefaults` to `true`.
+
+  - Since Spark 3.0, if `hive.default.fileformat` is not found in `Spark SQL 
configuration` then it will fallback to hive-site.xml present in the `Hadoop 
configuration` of `SparkContext`.
 
   - Since Spark 3.0, we pad decimal numbers with trailing zeros to the scale 
of the column for `spark-sql` interface, for example:
     <table class="table">
@@ -249,84 +340,6 @@ license: |
           </td>
         </tr>
     </table>
-    
-  - Since Spark 3.0, `CREATE TABLE` without a specific provider will use the 
value of `spark.sql.sources.default` as its provider. In Spark version 2.4 and 
earlier, it was hive. To restore the behavior before Spark 3.0, you can set 
`spark.sql.legacy.createHiveTableByDefault.enabled` to `true`.
-
-  - Since Spark 3.0, the unary arithmetic operator plus(`+`) only accepts 
string, numeric and interval type values as inputs. Besides, `+` with a 
integral string representation will be coerced to double value, e.g. `+'1'` 
results `1.0`. In Spark version 2.4 and earlier, this operator is ignored. 
There is no type checking for it, thus, all type values with a `+` prefix are 
valid, e.g. `+ array(1, 2)` is valid and results `[1, 2]`. Besides, there is no 
type coercion for it at all, e.g. in  [...]
-
-  - Since Spark 3.0, day-time interval strings are converted to intervals with 
respect to the `from` and `to` bounds. If an input string does not match to the 
pattern defined by specified bounds, the `ParseException` exception is thrown. 
For example, `interval '2 10:20' hour to minute` raises the exception because 
the expected format is `[+|-]h[h]:[m]m`. In Spark version 2.4, the `from` bound 
was not taken into account, and the `to` bound was used to truncate the 
resulted interval. For i [...]
-  
-  - Since Spark 3.0, negative scale of decimal is not allowed by default, e.g. 
data type of literal like `1E10BD` is `DecimalType(11, 0)`. In Spark version 
2.4 and earlier, it was `DecimalType(2, -9)`. To restore the behavior before 
Spark 3.0, you can set `spark.sql.legacy.allowNegativeScaleOfDecimal` to `true`.
-
-  - Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, 
smallint, tinyint as the 2nd argument, fractional and string types are not 
valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause 
`AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is 
fractional or string value, it will be coerced to int value, and the result 
will be a date value of `1964-06-04`.
-
-  - Since Spark 3.0, the function `percentile_approx` and its alias 
`approx_percentile` only accept integral value with range in `[1, 2147483647]` 
as its 3rd argument `accuracy`, fractional and string types are disallowed, 
e.g. `percentile_approx(10.0, 0.2, 1.8D)` will cause `AnalysisException`. In 
Spark version 2.4 and earlier, if `accuracy` is fractional or string value, it 
will be coerced to an int value, `percentile_approx(10.0, 0.2, 1.8D)` is 
operated as `percentile_approx(10.0, 0.2 [...]
-
-  - Since Spark 3.0, the properties listing below become reserved, commands 
will fail if we specify reserved properties in places like `CREATE DATABASE ... 
WITH DBPROPERTIES` and `ALTER TABLE ... SET TBLPROPERTIES`. We need their 
specific clauses to specify them, e.g. `CREATE DATABASE test COMMENT 'any 
comment' LOCATION 'some path'`. We can set 
`spark.sql.legacy.notReserveProperties` to `true` to ignore the 
`ParseException`, in this case, these properties will be silently removed, e.g 
`S [...]
-    <table class="table">
-        <tr>
-          <th>
-            <b>Property(case sensitive)</b>
-          </th>
-          <th>
-            <b>Database Reserved</b>
-          </th>
-          <th>
-            <b>Table Reserved</b>
-          </th>
-          <th>
-            <b>Remarks</b>
-          </th>
-        </tr>
-        <tr>
-          <td>
-            provider
-          </td>
-          <td>
-            no
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            For tables, please use the USING clause to specify it. Once set, 
it can't be changed.
-          </td>
-        </tr>
-        <tr>
-          <td>
-            location
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            For databases and tables, please use the LOCATION clause to 
specify it.
-          </td>
-        </tr>
-        <tr>
-          <td>
-            owner
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            yes
-          </td>
-          <td>
-            For databases and tables, it is determined by the user who runs 
spark and create the table.
-          </td>
-        </tr>
-    </table>
-
-  - Since Spark 3.0, `ADD FILE` can be used to add file directories as well. 
Earlier only single files can be added using this command. To restore the 
behaviour of earlier versions, set `spark.sql.legacy.addSingleFileInAddFile` to 
`true`.
-
-  - Since Spark 3.0, `SHOW TBLPROPERTIES` will cause `AnalysisException` if 
the table does not exist. In Spark version 2.4 and earlier, this scenario 
caused `NoSuchTableException`. Also, `SHOW TBLPROPERTIES` on a temporary view 
will cause `AnalysisException`. In Spark version 2.4 and earlier, it returned 
an empty result.
-
-  - Since Spark 3.0, `SHOW CREATE TABLE` will always return Spark DDL, even 
when the given table is a Hive serde table. For generating Hive DDL, please use 
`SHOW CREATE TABLE AS SERDE` command instead.
 
   - Since Spark 3.0, we upgraded the built-in Hive from 1.2 to 2.3 and it 
brings following impacts:
   


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31151][SQL][DOC] Reorganize the migration guide of SQL

Reply via email to