[GitHub] spark pull request: [SPARK-7447][SQL] Don't re-merge Parquet schem...

viirya Fri, 08 May 2015 10:27:18 -0700

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/6012


    [SPARK-7447][SQL] Don't re-merge Parquet schema when the relation is 
deserialized 

    JIRA: https://issues.apache.org/jira/browse/SPARK-7447
    
     `MetadataCache` in `ParquetRelation2` is annotated as `@transient`. When 
`ParquetRelation2` is deserialized, we ask `MetadataCache` to refresh and 
perform schema merging again. It is time-consuming especially for very many 
parquet files.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 without_remerge_schema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6012.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6012
    
----
commit b0fc09bb7164ce94c52da4d50415fd6616d790b6
Author: Liang-Chi Hsieh <[email protected]>
Date:   2015-05-08T17:15:37Z

    Don't generate and merge parquetSchema multiple times.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-7447][SQL] Don't re-merge Parquet schem...

Reply via email to