spark git commit: [SPARK-22267][SQL][TEST] Spark SQL incorrectly reads ORC files when column order is different

wenchen Mon, 11 Dec 2017 05:54:29 -0800

Repository: spark
Updated Branches:
  refs/heads/master ec873a4fd -> 6cc7021a4



[SPARK-22267][SQL][TEST] Spark SQL incorrectly reads ORC files when column 
order is different

## What changes were proposed in this pull request?

Until 2.2.1, with the default configuration, Apache Spark returns incorrect 
results when ORC file schema is different from metastore schema order. This is 
due to Hive 1.2.1 library and some issues on `convertMetastoreOrc` option.

```scala
scala> Seq(1 -> 2).toDF("c1", 
"c2").write.format("orc").mode("overwrite").save("/tmp/o")
scala> sql("CREATE EXTERNAL TABLE o(c2 INT, c1 INT) STORED AS orc LOCATION 
'/tmp/o'")
scala> spark.table("o").show    // This is wrong.
+---+---+
| c2| c1|
+---+---+
|  1|  2|
+---+---+
scala> spark.read.orc("/tmp/o").show  // This is correct.
+---+---+
| c1| c2|
+---+---+
|  1|  2|
+---+---+
```

After [SPARK-22279](https://github.com/apache/spark/pull/19499), the default 
configuration doesn't have this bug. Although Hive 1.2.1 library code path 
still has the problem, we had better have a test coverage on what we have now 
in order to prevent future regression on it.

## How was this patch tested?

Pass the Jenkins with a newly added test test.

Author: Dongjoon Hyun <dongj...@apache.org>

Closes #19928 from dongjoon-hyun/SPARK-22267.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6cc7021a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6cc7021a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6cc7021a

Branch: refs/heads/master
Commit: 6cc7021a40b64c41a51f337ec4be9545a25e838c
Parents: ec873a4
Author: Dongjoon Hyun <dongj...@apache.org>
Authored: Mon Dec 11 21:52:57 2017 +0800
Committer: Wenchen Fan <wenc...@databricks.com>
Committed: Mon Dec 11 21:52:57 2017 +0800

----------------------------------------------------------------------
 .../spark/sql/hive/execution/SQLQuerySuite.scala | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/6cc7021a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
----------------------------------------------------------------------
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
index c11e37a..f2562c3 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
@@ -2153,4 +2153,23 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton {
       }
     }
   }
+
+  test("SPARK-22267 Spark SQL incorrectly reads ORC files when column order is 
different") {
+    Seq("native", "hive").foreach { orcImpl =>
+      withSQLConf(SQLConf.ORC_IMPLEMENTATION.key -> orcImpl) {
+        withTempPath { f =>
+          val path = f.getCanonicalPath
+          Seq(1 -> 2).toDF("c1", "c2").write.orc(path)
+          checkAnswer(spark.read.orc(path), Row(1, 2))
+
+          withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") { // 
default since 2.3.0
+            withTable("t") {
+              sql(s"CREATE EXTERNAL TABLE t(c2 INT, c1 INT) STORED AS ORC 
LOCATION '$path'")
+              checkAnswer(spark.table("t"), Row(2, 1))
+            }
+          }
+        }
+      }
+    }
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22267][SQL][TEST] Spark SQL incorrectly reads ORC files when column order is different

Reply via email to