[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

kiszk Thu, 18 Oct 2018 02:39:41 -0700

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22746#discussion_r226237047
  
    --- Diff: docs/sql-data-sources-parquet.md ---
    @@ -0,0 +1,321 @@
    +---
    +layout: global
    +title: Parquet Files
    +displayTitle: Parquet Files
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +[Parquet](http://parquet.io) is a columnar format that is supported by 
many other data processing systems.
    +Spark SQL provides support for both reading and writing Parquet files that 
automatically preserves the schema
    +of the original data. When writing Parquet files, all columns are 
automatically converted to be nullable for
    +compatibility reasons.
    +
    +### Loading Data Programmatically
    +
    +Using the data from the above example:
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +{% include_example basic_parquet_example 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
    +</div>
    +
    +<div data-lang="java"  markdown="1">
    +{% include_example basic_parquet_example 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +{% include_example basic_parquet_example python/sql/datasource.py %}
    +</div>
    +
    +<div data-lang="r"  markdown="1">
    +
    +{% include_example basic_parquet_example r/RSparkSQLExample.R %}
    +
    +</div>
    +
    +<div data-lang="sql"  markdown="1">
    +
    +{% highlight sql %}
    +
    +CREATE TEMPORARY VIEW parquetTable
    +USING org.apache.spark.sql.parquet
    +OPTIONS (
    +  path "examples/src/main/resources/people.parquet"
    +)
    +
    +SELECT * FROM parquetTable
    +
    +{% endhighlight %}
    +
    +</div>
    +
    +</div>
    +
    +### Partition Discovery
    +
    +Table partitioning is a common optimization approach used in systems like 
Hive. In a partitioned
    +table, data are usually stored in different directories, with partitioning 
column values encoded in
    +the path of each partition directory. All built-in file sources (including 
Text/CSV/JSON/ORC/Parquet)
    +are able to discover and infer partitioning information automatically.
    +For example, we can store all our previously used
    +population data into a partitioned table using the following directory 
structure, with two extra
    +columns, `gender` and `country` as partitioning columns:
    +
    +{% highlight text %}
    +
    +path
    +âââ to
    +    âââ table
    +        âââ gender=male
    +        âÂ Â  âââ ...
    +        âÂ Â  â
    +        âÂ Â  âââ country=US
    +        âÂ Â  âÂ Â  âââ data.parquet
    +        âÂ Â  âââ country=CN
    +        âÂ Â  âÂ Â  âââ data.parquet
    +        âÂ Â  âââ ...
    +        âââ gender=female
    +         Â Â  âââ ...
    +         Â Â  â
    +         Â Â  âââ country=US
    +         Â Â  âÂ Â  âââ data.parquet
    +         Â Â  âââ country=CN
    +         Â Â  âÂ Â  âââ data.parquet
    +         Â Â  âââ ...
    +
    +{% endhighlight %}
    +
    +By passing `path/to/table` to either `SparkSession.read.parquet` or 
`SparkSession.read.load`, Spark SQL
    +will automatically extract the partitioning information from the paths.
    +Now the schema of the returned DataFrame becomes:
    +
    +{% highlight text %}
    +
    +root
    +|-- name: string (nullable = true)
    +|-- age: long (nullable = true)
    +|-- gender: string (nullable = true)
    +|-- country: string (nullable = true)
    +
    +{% endhighlight %}
    +
    +Notice that the data types of the partitioning columns are automatically 
inferred. Currently,
    +numeric data types, date, timestamp and string type are supported. 
Sometimes users may not want
    +to automatically infer the data types of the partitioning columns. For 
these use cases, the
    +automatic type inference can be configured by
    +`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default 
to `true`. When type
    +inference is disabled, string type will be used for the partitioning 
columns.
    +
    +Starting from Spark 1.6.0, partition discovery only finds partitions under 
the given paths
    +by default. For the above example, if users pass 
`path/to/table/gender=male` to either
    +`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not 
be considered as a
    +partitioning column. If users need to specify the base path that partition 
discovery
    +should start with, they can set `basePath` in the data source options. For 
example,
    +when `path/to/table/gender=male` is the path of the data and
    +users set `basePath` to `path/to/table/`, `gender` will be a partitioning 
column.
    +
    +### Schema Merging
    +
    +Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema 
evolution. Users can start with
    --- End diff --
    
    `ProtocolBuffer` -> `Protocol Buffers`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22746: [SPARK-24499][SQL][DOC] Split the page of sql-pro...

Reply via email to