Repository: spark
Updated Branches:
  refs/heads/branch-2.3 a5a8a86e2 -> bd83f7ba0

[SPARK-23421][SPARK-22356][SQL] Document the behavior change in

## What changes were proposed in this pull request? introduces a behavior change. We 
need to document it in the migration guide.

## How was this patch tested?
Also update the HiveExternalCatalogVersionsSuite to verify it.

Author: gatorsmile <>

Closes #20606 from gatorsmile/addMigrationGuide.

(cherry picked from commit a77ebb0921e390cf4fc6279a8c0a92868ad7e69b)
Signed-off-by: gatorsmile <>


Branch: refs/heads/branch-2.3
Commit: bd83f7ba097d9bca9a0e8c072f7566a645887a96
Parents: a5a8a86
Author: gatorsmile <>
Authored: Wed Feb 14 23:52:59 2018 -0800
Committer: gatorsmile <>
Committed: Wed Feb 14 23:53:10 2018 -0800

 docs/                                    | 2 ++
 .../apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/docs/ b/docs/
index 0f9f01e..cf9529a 100644
--- a/docs/
+++ b/docs/
@@ -1963,6 +1963,8 @@ working with timestamps in `pandas_udf`s to get the best 
performance, see
 ## Upgrading From Spark SQL 2.1 to 2.2
   - Spark 2.1.1 introduced a new configuration key: 
`spark.sql.hive.caseSensitiveInferenceMode`. It had a default setting of 
`NEVER_INFER`, which kept behavior identical to 2.1.0. However, Spark 2.2.0 
changes this setting's default value to `INFER_AND_SAVE` to restore 
compatibility with reading Hive metastore tables whose underlying file schema 
have mixed-case column names. With the `INFER_AND_SAVE` configuration value, on 
first access Spark will perform schema inference on any Hive metastore table 
for which it has not already saved an inferred schema. Note that schema 
inference can be a very time consuming operation for tables with thousands of 
partitions. If compatibility with mixed-case column names is not a concern, you 
can safely set `spark.sql.hive.caseSensitiveInferenceMode` to `NEVER_INFER` to 
avoid the initial overhead of schema inference. Note that with the new default 
`INFER_AND_SAVE` setting, the results of the schema inference are saved as a 
metastore key for future use
 . Therefore, the initial schema inference occurs only at a table's first 
+  - Since Spark 2.2.1 and 2.3.0, the schema is always inferred at runtime when 
the data source tables have the columns that exist in both partition schema and 
data schema. The inferred schema does not have the partitioned columns. When 
reading the table, Spark respects the partition values of these overlapping 
columns instead of the values stored in the data source files. In 2.2.0 and 
2.1.x release, the inferred schema is partitioned but the data of the table is 
invisible to users (i.e., the result set is empty).
 ## Upgrading From Spark SQL 2.0 to 2.1
diff --git 
index ae4aeb7..c13a750 100644
@@ -195,7 +195,7 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 object PROCESS_TABLES extends QueryTest with SQLTestUtils {
   // Tests the latest version of every release line.
-  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0")
+  val testingVersions = Seq("2.0.2", "2.1.2", "2.2.0", "2.2.1")
   protected var spark: SparkSession = _
@@ -249,7 +249,7 @@ object PROCESS_TABLES extends QueryTest with SQLTestUtils {
       // SPARK-22356: overlapped columns between data and partition schema in 
data source tables
       val tbl_with_col_overlap = s"tbl_with_col_overlap_$index"
-      // For Spark 2.2.0 and 2.1.x, the behavior is different from Spark 2.0.
+      // For Spark 2.2.0 and 2.1.x, the behavior is different from Spark 2.0, 
2.2.1, 2.3+
       if (testingVersions(index).startsWith("2.1") || testingVersions(index) 
== "2.2.0") {
         spark.sql("msck repair table " + tbl_with_col_overlap)
         assert(spark.table(tbl_with_col_overlap).columns === Array("i", "j", 

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to