spark git commit: [SPARK-23355][SQL][DOC][FOLLOWUP] Add migration doc for TBLPROPERTIES

gurwls223 Tue, 08 May 2018 17:40:02 -0700

Repository: spark
Updated Branches:
  refs/heads/master e3de6ab30 -> 9498e528d



[SPARK-23355][SQL][DOC][FOLLOWUP] Add migration doc for TBLPROPERTIES

## What changes were proposed in this pull request?

In Apache Spark 2.4, 
[SPARK-23355](https://issues.apache.org/jira/browse/SPARK-23355) fixes a bug 
which ignores table properties during convertMetastore for tables created by 
STORED AS ORC/PARQUET.

For some Parquet tables having table properties like TBLPROPERTIES 
(parquet.compression 'NONE'), it was ignored by default before Apache Spark 
2.4. After upgrading cluster, Spark will write uncompressed file which is 
different from Apache Spark 2.3 and old.

This PR adds a migration note for that.

## How was this patch tested?

N/A

Author: Dongjoon Hyun <dongj...@apache.org>

Closes #21269 from dongjoon-hyun/SPARK-23355-DOC.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9498e528
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9498e528
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9498e528

Branch: refs/heads/master
Commit: 9498e528d21e286e496da6ea9bf9c7ad73a7b5bd
Parents: e3de6ab
Author: Dongjoon Hyun <dongj...@apache.org>
Authored: Wed May 9 08:39:46 2018 +0800
Committer: hyukjinkwon <gurwls...@apache.org>
Committed: Wed May 9 08:39:46 2018 +0800

----------------------------------------------------------------------
 docs/sql-programming-guide.md | 2 ++
 1 file changed, 2 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/9498e528/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 075b953a..c521f3c 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1812,6 +1812,8 @@ working with timestamps in `pandas_udf`s to get the best 
performance, see
   - Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
   - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
   - In version 2.3 and earlier, `to_utc_timestamp` and `from_utc_timestamp` 
respect the timezone in the input timestamp string, which breaks the assumption 
that the input timestamp is in a specific timezone. Therefore, these 2 
functions can return unexpected results. In version 2.4 and later, this problem 
has been fixed. `to_utc_timestamp` and `from_utc_timestamp` will return null if 
the input timestamp string contains timezone. As an example, 
`from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return `2000-10-10 
01:00:00` in both Spark 2.3 and 2.4. However, `from_utc_timestamp('2000-10-10 
00:00:00+00:00', 'GMT+1')`, assuming a local timezone of GMT+8, will return 
`2000-10-10 09:00:00` in Spark 2.3 but `null` in 2.4. For people who don't care 
about this problem and want to retain the previous behaivor to keep their query 
unchanged, you can set `spark.sql.function.rejectTimezoneInString` to false. 
This option will be removed in Spark 3.0 and should only be used as a temporary 
w
 orkaround.
+  - In version 2.3 and earlier, Spark converts Parquet Hive tables by default 
but ignores table properties like `TBLPROPERTIES (parquet.compression 'NONE')`. 
This happens for ORC Hive table properties like `TBLPROPERTIES (orc.compress 
'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, too. Since Spark 
2.4, Spark respects Parquet/ORC specific table properties while converting 
Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id int) STORED AS 
PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would generate Snappy 
parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would 
be uncompressed parquet files.
+
 ## Upgrading From Spark SQL 2.2 to 2.3
 
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
the referenced columns only include the internal corrupt record column (named 
`_corrupt_record` by default). For example, 
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
 and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. 
Instead, you can cache or save the parsed results and then send the same query. 
For example, `val df = spark.read.schema(schema).json(file).cache()` and then 
`df.filter($"_corrupt_record".isNotNull).count()`.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23355][SQL][DOC][FOLLOWUP] Add migration doc for TBLPROPERTIES

Reply via email to