spark git commit: [SPARK-9618] [SQL] Use the specified schema when reading Parquet files

lian Wed, 05 Aug 2015 07:17:55 -0700

Repository: spark
Updated Branches:
  refs/heads/master 70112ff22 -> eb8bfa3ea



[SPARK-9618] [SQL] Use the specified schema when reading Parquet files

The user specified schema is currently ignored when loading Parquet files.

One workaround is to use the `format` and `load` methods instead of `parquet`, 
e.g.:

```
val schema = ???

// schema is ignored
sqlContext.read.schema(schema).parquet("hdfs:///test")

// schema is retained
sqlContext.read.schema(schema).format("parquet").load("hdfs:///test")
```

The fix is simple, but I wonder if the `parquet` method should instead be 
written in a similar fashion to `orc`:

```
def parquet(path: String): DataFrame = format("parquet").load(path)
```

Author: Nathan Howell <[email protected]>

Closes #7947 from NathanHowell/SPARK-9618 and squashes the following commits:

d1ea62c [Nathan Howell] [SPARK-9618] [SQL] Use the specified schema when 
reading Parquet files


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb8bfa3e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb8bfa3e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb8bfa3e

Branch: refs/heads/master
Commit: eb8bfa3eaa0846d685e4d12f9ee2e4273b85edcf
Parents: 70112ff
Author: Nathan Howell <[email protected]>
Authored: Wed Aug 5 22:16:56 2015 +0800
Committer: Cheng Lian <[email protected]>
Committed: Wed Aug 5 22:16:56 2015 +0800

----------------------------------------------------------------------
 sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/eb8bfa3e/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
----------------------------------------------------------------------
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
index eb09807..b90de8e 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@@ -260,7 +260,7 @@ class DataFrameReader private[sql](sqlContext: SQLContext) 
extends Logging {
 
       sqlContext.baseRelationToDataFrame(
         new ParquetRelation(
-          globbedPaths.map(_.toString), None, None, 
extraOptions.toMap)(sqlContext))
+          globbedPaths.map(_.toString), userSpecifiedSchema, None, 
extraOptions.toMap)(sqlContext))
     }
   }
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-9618] [SQL] Use the specified schema when reading Parquet files

Reply via email to