Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14649
Glad that helped, sorry if it wasn't more clear. Agreed that writing
summary metadata isn't always the best. In this patch, it only ever performs
the file pruning if the _metadata file exists for the dataset. At work we have
it enabled since we have a query-heavy workload where new data lands
occasionally.Â
Sent from Outlook
On Tue, Sep 27, 2016 at 10:13 AM -0700, "Cheng Lian"
<[email protected]> wrote:
@andreweduffy @andreweduffy Thanks for the explanations! This makes much
more sense to me now.
Although _metadata can be neat for the read path, it's a trouble maker for
the write path:
Writing summary files (either _metadata or _common_metadata) can be quite
expensive when writing a large Parquet dataset since it reads footers from all
files and tries to merge them. This can be especially frustrating when
appending a small amount of data to an existing large dataset.
Parquet doesn't always write the summary files even if you explicitly set
parquet.enable.summary-metadata to true. For example, when two files have
different values of a single key in the user-defined key/value metadata
section, Parquet simply gives up writing the summary files and delete existing
ones. This may be quite common in the case of schema evolution. What makes it
worse, outdated _common_metadata might not be deleted properly due to
PARQUET-359, which makes the summary files out of sync.
However, I still agree that with an existing trustworthy _metadata file at
hand, this patch is still very useful. I'll take a deeper look at this tomorrow.
â
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]