[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

kiszk Thu, 18 Oct 2018 03:08:13 -0700

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226246375
  
    --- Diff: docs/sql-migration-guide-upgrade.md ---
    @@ -0,0 +1,520 @@
    +---
    +layout: global
    +title: Spark SQL Upgrading Guide
    +displayTitle: Spark SQL Upgrading Guide
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +## Upgrading From Spark SQL 2.4 to 3.0
    +
    +  - In PySpark, when creating a `SparkSession` with 
`SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, 
the builder was trying to update the `SparkConf` of the existing `SparkContext` 
with configurations specified to the builder, but the `SparkContext` is shared 
by all `SparkSession`s, so we should not update them. Since 3.0, the builder 
come to not update the configurations. This is the same behavior as Java/Scala 
API in 2.3 and above. If you want to update them, you need to update them prior 
to creating a `SparkSession`.
    +
    +## Upgrading From Spark SQL 2.3 to 2.4
    +
    +  - In Spark version 2.3 and earlier, the second parameter to 
array_contains function is implicitly promoted to the element type of first 
array type parameter. This type promotion can be lossy and may cause 
`array_contains` function to return wrong result. This problem has been 
addressed in 2.4 by employing a safer type promotion mechanism. This can cause 
some change in behavior and are illustrated in the table below.
    +  <table class="table">
    +        <tr>
    +          <th>
    +            <b>Query</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.3 or Prior</b>
    +          </th>
    +          <th>
    +            <b>Result Spark 2.4</b>
    +          </th>
    +          <th>
    +            <b>Remarks</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), 1.34D);</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>false</b>
    +          </th>
    +          <th>
    +            <b>In Spark 2.4, left and right parameters are  promoted to 
array(double) and double type respectively.</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), '1');</b>
    +          </th>
    +          <th>
    +            <b>true</b>
    +          </th>
    +          <th>
    +            <b>AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.</b>
    +          </th>
    +          <th>
    +            <b>Users can use explict cast</b>
    +          </th>
    +        </tr>
    +        <tr>
    +          <th>
    +            <b>SELECT <br> array_contains(array(1), 'anystring');</b>
    +          </th>
    +          <th>
    +            <b>null</b>
    +          </th>
    +          <th>
    +            <b>AnalysisException is thrown since integer type can not be 
promoted to string type in a loss-less manner.</b>
    +          </th>
    +          <th>
    +            <b>Users can use explict cast</b>
    +          </th>
    +        </tr>
    +  </table>
    +
    +  - Since Spark 2.4, when there is a struct field in front of the IN 
operator before a subquery, the inner query must contain a struct field as 
well. In previous versions, instead, the fields of the struct were compared to 
the output of the inner query. Eg. if `a` is a `struct(a string, b int)`, in 
Spark 2.4 `a in (select (1 as a, 'a' as b) from range(1))` is a valid query, 
while `a in (select 1, 'a' from range(1))` is not. In previous version it was 
the opposite.
    +  - In versions 2.2.1+ and 2.3, if `spark.sql.caseSensitive` is set to 
true, then the `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions incorrectly 
became case-sensitive and would resolve to columns (unless typed in lower 
case). In Spark 2.4 this has been fixed and the functions are no longer 
case-sensitive.
    +  - Since Spark 2.4, Spark will evaluate the set operations referenced in 
a query by following a precedence rule as per the SQL standard. If the order is 
not specified by parentheses, set operations are performed from left to right 
with the exception that all INTERSECT operations are performed before any 
UNION, EXCEPT or MINUS operations. The old behaviour of giving equal precedence 
to all the set operations are preserved under a newly added configuration 
`spark.sql.legacy.setopsPrecedence.enabled` with a default value of `false`. 
When this property is set to `true`, spark will evaluate the set operators from 
left to right as they appear in the query given no explicit ordering is 
enforced by usage of parenthesis.
    +  - Since Spark 2.4, Spark will display table description column Last 
Access value as UNKNOWN when the value was Jan 01 1970.
    +  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
    +  - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization is unable to be used whereas 
`createDataFrame` from Pandas DataFrame allowed the fallback to 
non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas 
DataFrame allow the fallback by default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
    +  - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
    +  - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
    --- End diff --
    
    `an column` -> `a column`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Reply via email to