GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/4308
[SPARK-5182] [SPARK-5528] [SQL] WIP: Parquet data source improvements
This PR adds three major improvements to Parquet data source:
1. Partition discovery
While reading Parquet files resides in Hive style partition
directories, `ParquetRelation2` automatically discovers partitioning
information and infers partition column types.
This is also a partial work for [SPARK-5182] [1], which aims to provide
first class partitioning support for the data source API. Related code in this
PR can be easily extracted to the data source API level in future versions.
1. Schema merging
When enabled, Parquet data source collects schema information from all
Parquet part-files and tries to merge them. Exceptions are thrown when
incompatible schemas are detected. This feature is controlled by data source
option `parquet.mergeSchema`, and is enabled by default.
1. Metastore Parquet table conversion moved to analysis phase
This greatly simplifies the conversion logic. `ParquetConversion`
strategy can be removed once the old Parquet implementation is removed in the
future.
This version of Parquet data source aims to entirely replace the old
Parquet implementation. However, the old version hasn't been removed yet.
Users can fall back to the old version by turning off SQL configuration
`spark.sql.parquet.useDataSourceApi`.
Other JIRA tickets fixed as side effects in this PR:
- [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare
binary types.
- [SPARK-3575] [4]: Metastore schema is now preserved and passed to
`ParquetRelation2` via data source option `parquet.metastoreSchema`.
TODO:
- [ ] More test cases for partition discovery
- [ ] Fix write path after data source write support (#4294) is merged
It turned out to be non-trivial to fall back to old Parquet
implementation on the write path when Parquet data source is enabled. Since
we're planning to include data source write support in 1.3.0, I simply ignored
two test cases involving Parquet insertion for now.
- [ ] Fix outdated comments and documentations
PS: More than a half of changed lines in this PR are trivial changes to
test cases. To test Parquet with and without the new data source, almost all
Parquet test cases are moved into wrapper driver functions. This introduces
hundreds of lines of changes, etc.).
[1]: https://issues.apache.org/jira/browse/SPARK-5182
[2]: https://issues.apache.org/jira/browse/SPARK-5528
[3]: https://issues.apache.org/jira/browse/SPARK-5509
[4]: https://issues.apache.org/jira/browse/SPARK-3575
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark parquet-partition-discovery
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4308.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4308
----
commit c0f220f76b15eafecfa14cc7021a2472384c8d14
Author: Cheng Lian <[email protected]>
Date: 2015-01-30T01:41:18Z
Draft version of Parquet partition discovery and schema merging
commit 5c405a8f3f8c511454e268379ec2348bdcb8902e
Author: Cheng Lian <[email protected]>
Date: 2015-02-01T00:23:27Z
Fixes all existing Parquet test suites except for ParquetMetastoreSuite
commit 5a5e18ed2e213904525375643ef7a2e1e34a590e
Author: Cheng Lian <[email protected]>
Date: 2015-02-02T04:34:09Z
Fixes Metastore Parquet table conversion
commit af3683ea68d3efe7c0368cb8d23fdd661fbfeffc
Author: Cheng Lian <[email protected]>
Date: 2015-02-02T11:30:01Z
Uses switch to control whether use Parquet data source or not
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]