Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/22746#discussion_r226237047
--- Diff: docs/sql-data-sources-parquet.md ---
@@ -0,0 +1,321 @@
+---
+layout: global
+title: Parquet Files
+displayTitle: Parquet Files
+---
+
+* Table of contents
+{:toc}
+
+[Parquet](http://parquet.io) is a columnar format that is supported by
many other data processing systems.
+Spark SQL provides support for both reading and writing Parquet files that
automatically preserves the schema
+of the original data. When writing Parquet files, all columns are
automatically converted to be nullable for
+compatibility reasons.
+
+### Loading Data Programmatically
+
+Using the data from the above example:
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+{% include_example basic_parquet_example
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% include_example basic_parquet_example
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+<div data-lang="python" markdown="1">
+
+{% include_example basic_parquet_example python/sql/datasource.py %}
+</div>
+
+<div data-lang="r" markdown="1">
+
+{% include_example basic_parquet_example r/RSparkSQLExample.R %}
+
+</div>
+
+<div data-lang="sql" markdown="1">
+
+{% highlight sql %}
+
+CREATE TEMPORARY VIEW parquetTable
+USING org.apache.spark.sql.parquet
+OPTIONS (
+ path "examples/src/main/resources/people.parquet"
+)
+
+SELECT * FROM parquetTable
+
+{% endhighlight %}
+
+</div>
+
+</div>
+
+### Partition Discovery
+
+Table partitioning is a common optimization approach used in systems like
Hive. In a partitioned
+table, data are usually stored in different directories, with partitioning
column values encoded in
+the path of each partition directory. All built-in file sources (including
Text/CSV/JSON/ORC/Parquet)
+are able to discover and infer partitioning information automatically.
+For example, we can store all our previously used
+population data into a partitioned table using the following directory
structure, with two extra
+columns, `gender` and `country` as partitioning columns:
+
+{% highlight text %}
+
+path
+âââ to
+ âââ table
+ âââ gender=male
+ â  âââ ...
+ â  â
+ â  âââ country=US
+ â  â  âââ data.parquet
+ â  âââ country=CN
+ â  â  âââ data.parquet
+ â  âââ ...
+ âââ gender=female
+ Â Â âââ ...
+ Â Â â
+ Â Â âââ country=US
+   â  âââ data.parquet
+ Â Â âââ country=CN
+   â  âââ data.parquet
+ Â Â âââ ...
+
+{% endhighlight %}
+
+By passing `path/to/table` to either `SparkSession.read.parquet` or
`SparkSession.read.load`, Spark SQL
+will automatically extract the partitioning information from the paths.
+Now the schema of the returned DataFrame becomes:
+
+{% highlight text %}
+
+root
+|-- name: string (nullable = true)
+|-- age: long (nullable = true)
+|-- gender: string (nullable = true)
+|-- country: string (nullable = true)
+
+{% endhighlight %}
+
+Notice that the data types of the partitioning columns are automatically
inferred. Currently,
+numeric data types, date, timestamp and string type are supported.
Sometimes users may not want
+to automatically infer the data types of the partitioning columns. For
these use cases, the
+automatic type inference can be configured by
+`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default
to `true`. When type
+inference is disabled, string type will be used for the partitioning
columns.
+
+Starting from Spark 1.6.0, partition discovery only finds partitions under
the given paths
+by default. For the above example, if users pass
`path/to/table/gender=male` to either
+`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not
be considered as a
+partitioning column. If users need to specify the base path that partition
discovery
+should start with, they can set `basePath` in the data source options. For
example,
+when `path/to/table/gender=male` is the path of the data and
+users set `basePath` to `path/to/table/`, `gender` will be a partitioning
column.
+
+### Schema Merging
+
+Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema
evolution. Users can start with
--- End diff --
`ProtocolBuffer` -> `Protocol Buffers`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]