Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/5001#discussion_r26354695
--- Diff: docs/sql-programming-guide.md ---
@@ -907,9 +907,132 @@ SELECT * FROM parquetTable
</div>
+### Partition discovery
+
+Table partitioning is a common optimization approach used in systems like
Hive. In a partitioned
+table, data are usually stored in different directories, with partitioning
column values encoded in
+the path of each partition directory. The Parquet data source is now able
to discover and infer
+partitioning information automatically. For exmaple, we can store all our
previously used
+population data into a partitioned table using the following directory
structure, with two extra
+columns, `sex` and `country` as partitioning columns:
+
+{% highlight text %}
+
+path
+âââ to
+ âââ table
+ âââ sex=0
+ â  âââ ...
+ â  â
+ â  âââ country=US
+ â  â  âââ data.parquet
+ â  âââ country=CN
+ â  â  âââ data.parquet
+ â  âââ ...
+ âââ sex=1
+ Â Â âââ ...
+ Â Â â
+ Â Â âââ country=US
+   â  âââ data.parquet
+ Â Â âââ country=CN
+   â  âââ data.parquet
+ Â Â âââ ...
+
+{% endhighlight %}
+
+By passing `path/to/table` to either `SQLContext.parquetFile` or
`SQLContext.load`, Spark SQL will
+automatically extract the partitioning information from the paths. Now
the schema of the returned
+DataFrame becomes:
+
+{% highlight text %}
+
+root
+|-- name: string (nullable = true)
+|-- age: long (nullable = true)
+|-- sex: string (nullable = true)
+|-- country: string (nullable = true)
+
+{% endhighlight %}
+
+Notice that the data types of the partitioning columns are automatically
inferred. Currently,
+numeric data types and string type are supported.
+
+### Schema merging
+
+Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema
evolution. Users can start with
+a simple schema, and gradually add more columns to the schema as needed.
In this way, users may end
+up with multiple Parquet files with different but mutually compatible
schemas. The Parquet data
+source is now able to automatically detect this case and merge schemas of
all these files.
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+
+{% highlight scala %}
+// sqlContext from the previous example is used in this example.
+// This is used to implicitly convert an RDD to a DataFrame.
+import sqlContext.implicits._
+
+// Create a simple DataFrame, stored into a partition directory
+val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single",
"double")
+df1.saveAsParquetFile("data/test_table/key=1")
+
+// Create another DataFrame in a new partition directory,
+// adding a new column and dropping an existing column
+val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i *
3)).toDF("single", "triple")
+df2.saveAsParquetFile("data/test_table/key=2")
+
+// Read the partitioned table
+val df3 = sqlContext.parquetFile("data/test_table")
+df3.printSchema()
+
+// The final schema consists of all 3 columns in the Parquet files together
+// with the partiioning column appeared in the partition directory paths.
+// root
+// |-- single: int (nullable = true)
+// |-- double: int (nullable = true)
+// |-- triple: int (nullable = true)
+// |-- key : int (nullable = true)
+{% endhighlight %}
+
+</div>
+
+<div data-lang="python" markdown="1">
+
+{% highlight python %}
+# sqlContext from the previous example is used in this example.
+
+# Create a simple DataFrame, stored into a partition directory
+df1 = sqlContext.createDataFrame(sc.parallelize(range(1, 6))\
+ .map(lambda i: Row(single=i, double=i *
2)))
+df1.save("data/test_table/key=1", "parquet")
+
+# Create another DataFrame in a new partition directory,
+# adding a new column and dropping an existing column
+df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
+ .map(lambda i: Row(single=i, triple=i *
3)))
+df2.save("data/test_table/key=2", "parquet")
+
+# Read the partitioned table
+df3 = sqlContext.parquetFile("data/test_table")
+df3.printSchema()
+
+# The final schema consists of all 3 columns in the Parquet files together
+# with the partiioning column appeared in the partition directory paths.
+# root
+# |-- single: int (nullable = true)
+# |-- double: int (nullable = true)
+# |-- triple: int (nullable = true)
+# |-- key : int (nullable = true)
+{% endhighlight %}
--- End diff --
I didn't include a Java version here, because it turned out to be quite
verbose and might not be that helpful to illustrate this feature. I can add one
if it's considered necessary.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]