Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/7238#issuecomment-126017537
Hey @viirya, sorry that I lied, this might not be the FINAL check yet...
Actually I began to regret one of my previous decision, namely merging
part-files which don't have corresponding summary files. This is mostly
because there are too many cases to consider if we assume summary files may be
missing, and this makes the behavior of this configuration pretty much
unintuitive---sometimes no part-files are merged, while sometimes some
part-files get merged. Parquet summary files can be missing under various
corner cases, it's hard to explain the behavior and may confuse Spark user.
The key problem here is that Parquet summary files are not written/accessed in
an atomic manner. And that's one of the most important reason why the Parquet
team is actually trying to get rid of the summary file entirely.
Since the configuration is named "respectSummaryFiles", it seems more
natural and intuitive to assume that summary files are ALWAYS properly
generated for ALL Parquet write jobs when this configuration is turned on. To
be more specific, given one or more Parquet input paths, we may find 1 or more
summary files. Metadata gathered by merging all these summary files should
reflect the real schema of the given Parquet dataset. Only in this case, we
can really "respect" existing summary files.
So my suggestion here is that, when the "respectSummaryFiles" configuration
is turned on, we only collects all summary files, merge schemas read from them,
and just use the merged schema as the final result schema. And of course, this
configuration should still be turned off by default. We can document this
configuration with an "expert only" tag.
I still consider this configuration quite useful, because even if you got a
dirty Parquet dataset without summary files or with incorrect summary files at
hand, you can still repair the summary files quite easily. Essentially you
only need to call `ParquetOutputFormat.writeMetaDataFile`, either generates
correct summary files for the entire dataset or deletes ill summary files if it
fails to merge all user-defined key-value metadata.
How do you think? Again, sorry for the late review extra efforts for
implementing all those intermediate versions...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]