Github user liancheng commented on the issue:
https://github.com/apache/spark/pull/14649
@andreweduffy @andreweduffy Thanks for the explanations! This makes much
more sense to me now.
Although `_metadata` can be neat for the read path, it's a trouble maker
for the write path:
1. Writing summary files (either `_metadata` or `_common_metadata`) can be
quite expensive when writing a large Parquet dataset since it reads footers
from all files and tries to merge them. This can be especially frustrating when
appending a small amount of data to an existing large dataset.
2. Parquet doesn't always write the summary files even if you explicitly
set `parquet.enable.summary-metadata` to true. For example, when two files have
different values of a single key in the user-defined key/value metadata
section, Parquet simply gives up writing the summary files and delete existing
ones. This may be quite common in the case of schema evolution. What makes it
worse, outdated `_common_metadata` might not be deleted properly due to
PARQUET-359, which makes the summary files out of sync.
However, I still agree that with an existing trustworthy `_metadata` file
at hand, this patch is still very useful. I'll take a deeper look at this
tomorrow.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]