[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

liancheng Tue, 27 Sep 2016 10:13:35 -0700

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14649
  
    @andreweduffy @andreweduffy Thanks for the explanations! This makes much 
more sense to me now. 
    
    Although `_metadata` can be neat for the read path, it's a trouble maker 
for the write path:
    
    1. Writing summary files (either `_metadata` or `_common_metadata`) can be 
quite expensive when writing a large Parquet dataset since it reads footers 
from all files and tries to merge them. This can be especially frustrating when 
appending a small amount of data to an existing large dataset.
    2. Parquet doesn't always write the summary files even if you explicitly 
set `parquet.enable.summary-metadata` to true. For example, when two files have 
different values of a single key in the user-defined key/value metadata 
section, Parquet simply gives up writing the summary files and delete existing 
ones. This may be quite common in the case of schema evolution. What makes it 
worse, outdated `_common_metadata` might not be deleted properly due to 
PARQUET-359, which makes the summary files out of sync.
    
    However, I still agree that with an existing trustworthy `_metadata` file 
at hand, this patch is still very useful. I'll take a deeper look at this 
tomorrow.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

Reply via email to