spark git commit: [SPARK-9618] [SQL] Use the specified schema when reading Parquet files

rxin Thu, 06 Aug 2015 13:19:53 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 8b00c0690 -> d5f788121



[SPARK-9618] [SQL] Use the specified schema when reading Parquet files

The user specified schema is currently ignored when loading Parquet files.

One workaround is to use the `format` and `load` methods instead of `parquet`, 
e.g.:

```
val schema = ???

// schema is ignored
sqlContext.read.schema(schema).parquet("hdfs:///test")

// schema is retained
sqlContext.read.schema(schema).format("parquet").load("hdfs:///test")
```

The fix is simple, but I wonder if the `parquet` method should instead be 
written in a similar fashion to `orc`:

```
def parquet(path: String): DataFrame = format("parquet").load(path)
```

Author: Nathan Howell <[email protected]>

Closes #7947 from NathanHowell/SPARK-9618 and squashes the following commits:

d1ea62c [Nathan Howell] [SPARK-9618] [SQL] Use the specified schema when 
reading Parquet files

(cherry picked from commit eb8bfa3eaa0846d685e4d12f9ee2e4273b85edcf)
Signed-off-by: Reynold Xin <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d5f78812
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d5f78812
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d5f78812

Branch: refs/heads/branch-1.5
Commit: d5f788121bebc3266e961d2e9042fe9a4049c8a4
Parents: 8b00c06
Author: Nathan Howell <[email protected]>
Authored: Wed Aug 5 22:16:56 2015 +0800
Committer: Reynold Xin <[email protected]>
Committed: Thu Aug 6 13:19:17 2015 -0700

----------------------------------------------------------------------
 sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/d5f78812/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index eb09807..b90de8e 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -260,7 +260,7 @@ class DataFrameReader private[sql](sqlContext: SQLContext) 
extends Logging {
 
       sqlContext.baseRelationToDataFrame(
         new ParquetRelation(
-          globbedPaths.map(_.toString), None, None, 
extraOptions.toMap)(sqlContext))
+          globbedPaths.map(_.toString), userSpecifiedSchema, None, 
extraOptions.toMap)(sqlContext))
     }
   }
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-9618] [SQL] Use the specified schema when reading Parquet files

Reply via email to