[GitHub] spark pull request: [SPARK-5182] [SPARK-5528] [SQL] WIP: Parquet d...

liancheng Mon, 02 Feb 2015 04:18:35 -0800

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/4308


    [SPARK-5182] [SPARK-5528] [SQL] WIP: Parquet data source improvements

    This PR adds three major improvements to Parquet data source:
    
    1.  Partition discovery
    
        While reading Parquet files resides in Hive style partition 
directories, `ParquetRelation2` automatically discovers partitioning 
information and infers partition column types.
    
        This is also a partial work for [SPARK-5182] [1], which aims to provide 
first class partitioning support for the data source API.  Related code in this 
PR can be easily extracted to the data source API level in future versions.
    
    1.  Schema merging
    
        When enabled, Parquet data source collects schema information from all 
Parquet part-files and tries to merge them.  Exceptions are thrown when 
incompatible schemas are detected.  This feature is controlled by data source 
option `parquet.mergeSchema`, and is enabled by default.
    
    1.  Metastore Parquet table conversion moved to analysis phase
    
        This greatly simplifies the conversion logic.  `ParquetConversion` 
strategy can be removed once the old Parquet implementation is removed in the 
future.
    
    This version of Parquet data source aims to entirely replace the old 
Parquet implementation.  However, the old version hasn't been removed yet.  
Users can fall back to the old version by turning off SQL configuration 
`spark.sql.parquet.useDataSourceApi`.
    
    Other JIRA tickets fixed as side effects in this PR:
    
    - [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare 
binary types.
    
    - [SPARK-3575] [4]: Metastore schema is now preserved and passed to 
`ParquetRelation2` via data source option `parquet.metastoreSchema`.
    
    TODO:
    
    - [ ] More test cases for partition discovery
    - [ ] Fix write path after data source write support (#4294) is merged
    
          It turned out to be non-trivial to fall back to old Parquet 
implementation on the write path when Parquet data source is enabled.  Since 
we're planning to include data source write support in 1.3.0, I simply ignored 
two test cases involving Parquet insertion for now.
    
    - [ ] Fix outdated comments and documentations
    
    PS: More than a half of changed lines in this PR are trivial changes to 
test cases. To test Parquet with and without the new data source, almost all 
Parquet test cases are moved into wrapper driver functions. This introduces 
hundreds of lines of changes, etc.).
    
    [1]: https://issues.apache.org/jira/browse/SPARK-5182
    [2]: https://issues.apache.org/jira/browse/SPARK-5528
    [3]: https://issues.apache.org/jira/browse/SPARK-5509
    [4]: https://issues.apache.org/jira/browse/SPARK-3575


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark parquet-partition-discovery

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4308.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4308
    
----
commit c0f220f76b15eafecfa14cc7021a2472384c8d14
Author: Cheng Lian <[email protected]>
Date:   2015-01-30T01:41:18Z

    Draft version of Parquet partition discovery and schema merging

commit 5c405a8f3f8c511454e268379ec2348bdcb8902e
Author: Cheng Lian <[email protected]>
Date:   2015-02-01T00:23:27Z

    Fixes all existing Parquet test suites except for ParquetMetastoreSuite

commit 5a5e18ed2e213904525375643ef7a2e1e34a590e
Author: Cheng Lian <[email protected]>
Date:   2015-02-02T04:34:09Z

    Fixes Metastore Parquet table conversion

commit af3683ea68d3efe7c0368cb8d23fdd661fbfeffc
Author: Cheng Lian <[email protected]>
Date:   2015-02-02T11:30:01Z

    Uses switch to control whether use Parquet data source or not

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5182] [SPARK-5528] [SQL] WIP: Parquet d...

Reply via email to