GitHub user AndreSchumacher opened a pull request:
https://github.com/apache/spark/pull/360
SPARK-1293 [SQL] WIP Parquet support for nested types
It should be possible to import and export data stored in Parquet's
columnar format that contains nested types. For example:
```java
message AddressBook {
required string owner;
optional group ownerPhoneNumbers {
repeated string values;
}
repeated group contacts {
required string name;
optional string phoneNumber;
}
}
```
The example could model a type (AddressBook) that contains records made of
strings (owner), lists (ownerPhoneNumbers) and a table of contacts (e.g., a
list of pairs of a map). The list of tasks are as follows:
<h6>Implement support for converting nested Parquet types to Spark/Catalyst
types:</h6>
- [x] Structs
- [x] Lists
- [ ] Maps
<h6>Implement import (via ``parquetFile``) of nested Parquet types (first
version in this PR)</h6>
- [x] Initial version (without maps)
<h6>Implement export (via ``saveAsParquetFile``)</h6>
- [ ] Initial version (missing)
<h6>Test support for AvroParquet, etc.</h6>
Example:
```scala
val data = TestSQLContext
.parquetFile("input.dir")
.toSchemaRDD
data.registerAsTable("data")
sql("SELECT owner, contacts[1].name FROM data").collect()
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/AndreSchumacher/spark nested_parquet
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/360.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #360
----
commit 7318fe19eac8caf471feec8e4830a538aa647770
Author: Andre Schumacher <[email protected]>
Date: 2014-03-26T07:46:10Z
Adding conversion of nested Parquet schemas
commit 0649f3b407632041df63d3306773b657255dbcb3
Author: Andre Schumacher <[email protected]>
Date: 2014-03-27T16:24:13Z
First commit nested Parquet read converters
commit 341d7e55e7a66e04f0ce45b5b1fa9f8cb7debbeb
Author: Andre Schumacher <[email protected]>
Date: 2014-03-27T17:48:16Z
First working nested Parquet record input
commit 832d263c1056efe04fe353b9718ce3f0ad307c28
Author: Andre Schumacher <[email protected]>
Date: 2014-04-01T13:17:02Z
Completing testcase for nested data (Addressbook(
commit 7f5bd07876aa2a228db67b3bd7d2baa938d1c79c
Author: Andre Schumacher <[email protected]>
Date: 2014-04-01T14:15:23Z
Extending tests for nested Parquet data
commit e9da236fdad31071fddd668dde3b7c303cd08d79
Author: Andre Schumacher <[email protected]>
Date: 2014-04-02T12:42:19Z
Fixing one problem with nested arrays
commit e4375db6d50baf4f629dd71e82b92881841c3b04
Author: Andre Schumacher <[email protected]>
Date: 2014-04-02T14:00:46Z
fixing one problem with nested structs and breaking up files
commit 7c4e79aa61fcbeba0f06c8e40e23a2f486e0cce8
Author: Andre Schumacher <[email protected]>
Date: 2014-04-02T14:45:22Z
added struct converter
commit 04e97d1c355e054c9db51766e2582700f299751e
Author: Andre Schumacher <[email protected]>
Date: 2014-04-03T15:11:40Z
fixing one problem with arrayconverter
commit 0cc0edb93f5b89197697286ec9cc705cd8fd5edf
Author: Andre Schumacher <[email protected]>
Date: 2014-04-04T16:56:56Z
Documenting conversions, bugfix, wrappers of Rows
commit 0fae86af7a463bb2ad04db571c970b85bc6de333
Author: Andre Schumacher <[email protected]>
Date: 2014-04-06T14:19:23Z
Fixing some problems intruduced during rebase
commit 2dc7adc23deb1ba05bef123db08b939a0d386082
Author: Andre Schumacher <[email protected]>
Date: 2014-04-06T16:04:44Z
For primitive rows fall back to more efficient converter, code reorg
commit 8df7d0c1c710bb44fa904165f6b0352732c83468
Author: Andre Schumacher <[email protected]>
Date: 2014-04-08T07:27:26Z
Adding resolution of complex ArrayTypes
commit 79b6a7a1c126e54bbd31614b202e39fc1d882e93
Author: Andre Schumacher <[email protected]>
Date: 2014-04-08T14:55:46Z
Scalastyle
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---