[GitHub] spark pull request: [SPARK-5310] [SQL] [DOC] Parquet section for t...

liancheng Thu, 12 Mar 2015 16:39:52 -0700

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5001#discussion_r26354695
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -907,9 +907,132 @@ SELECT * FROM parquetTable
     
     </div>
     
    +### Partition discovery
    +
    +Table partitioning is a common optimization approach used in systems like 
Hive.  In a partitioned
    +table, data are usually stored in different directories, with partitioning 
column values encoded in
    +the path of each partition directory.  The Parquet data source is now able 
to discover and infer
    +partitioning information automatically.  For exmaple, we can store all our 
previously used
    +population data into a partitioned table using the following directory 
structure, with two extra
    +columns, `sex` and `country` as partitioning columns:
    +
    +{% highlight text %}
    +
    +path
    +âââ to
    +    âââ table
    +        âââ sex=0
    +        âÂ Â  âââ ...
    +        âÂ Â  â
    +        âÂ Â  âââ country=US
    +        âÂ Â  âÂ Â  âââ data.parquet
    +        âÂ Â  âââ country=CN
    +        âÂ Â  âÂ Â  âââ data.parquet
    +        âÂ Â  âââ ...
    +        âââ sex=1
    +         Â Â  âââ ...
    +         Â Â  â
    +         Â Â  âââ country=US
    +         Â Â  âÂ Â  âââ data.parquet
    +         Â Â  âââ country=CN
    +         Â Â  âÂ Â  âââ data.parquet
    +         Â Â  âââ ...
    +
    +{% endhighlight %}
    +
    +By passing `path/to/table` to either `SQLContext.parquetFile` or 
`SQLContext.load`, Spark SQL will
    +automatically extract the partitioning information from the paths.  Now 
the schema of the returned
    +DataFrame becomes:
    +
    +{% highlight text %}
    +
    +root
    +|-- name: string (nullable = true)
    +|-- age: long (nullable = true)
    +|-- sex: string (nullable = true)
    +|-- country: string (nullable = true)
    +
    +{% endhighlight %}
    +
    +Notice that the data types of the partitioning columns are automatically 
inferred.  Currently,
    +numeric data types and string type are supported.
    +
    +### Schema merging
    +
    +Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema 
evolution.  Users can start with
    +a simple schema, and gradually add more columns to the schema as needed.  
In this way, users may end
    +up with multiple Parquet files with different but mutually compatible 
schemas.  The Parquet data
    +source is now able to automatically detect this case and merge schemas of 
all these files.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +
    +{% highlight scala %}
    +// sqlContext from the previous example is used in this example.
    +// This is used to implicitly convert an RDD to a DataFrame.
    +import sqlContext.implicits._
    +
    +// Create a simple DataFrame, stored into a partition directory
    +val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", 
"double")
    +df1.saveAsParquetFile("data/test_table/key=1")
    +
    +// Create another DataFrame in a new partition directory,
    +// adding a new column and dropping an existing column
    +val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i * 
3)).toDF("single", "triple")
    +df2.saveAsParquetFile("data/test_table/key=2")
    +
    +// Read the partitioned table
    +val df3 = sqlContext.parquetFile("data/test_table")
    +df3.printSchema()
    +
    +// The final schema consists of all 3 columns in the Parquet files together
    +// with the partiioning column appeared in the partition directory paths.
    +// root
    +// |-- single: int (nullable = true)
    +// |-- double: int (nullable = true)
    +// |-- triple: int (nullable = true)
    +// |-- key : int (nullable = true)
    +{% endhighlight %}
    +
    +</div>
    +
    +<div data-lang="python"  markdown="1">
    +
    +{% highlight python %}
    +# sqlContext from the previous example is used in this example.
    +
    +# Create a simple DataFrame, stored into a partition directory
    +df1 = sqlContext.createDataFrame(sc.parallelize(range(1, 6))\
    +                                   .map(lambda i: Row(single=i, double=i * 
2)))
    +df1.save("data/test_table/key=1", "parquet")
    +
    +# Create another DataFrame in a new partition directory,
    +# adding a new column and dropping an existing column
    +df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
    +                                   .map(lambda i: Row(single=i, triple=i * 
3)))
    +df2.save("data/test_table/key=2", "parquet")
    +
    +# Read the partitioned table
    +df3 = sqlContext.parquetFile("data/test_table")
    +df3.printSchema()
    +
    +# The final schema consists of all 3 columns in the Parquet files together
    +# with the partiioning column appeared in the partition directory paths.
    +# root
    +# |-- single: int (nullable = true)
    +# |-- double: int (nullable = true)
    +# |-- triple: int (nullable = true)
    +# |-- key : int (nullable = true)
    +{% endhighlight %}
    --- End diff --
    
    I didn't include a Java version here, because it turned out to be quite 
verbose and might not be that helpful to illustrate this feature. I can add one 
if it's considered necessary.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5310] [SQL] [DOC] Parquet section for t...

Reply via email to