Github user mallman commented on the issue:
https://github.com/apache/spark/pull/14750
@clockfly I can offer one answer to your question. One of the main benefits
of this change is to allow us to remove the costly schema reconciliation
between hive metastore schema and on-disk parquet file schema. Currently this
reconciliation must be performed for parquet files whose schema include a
column with an uppercase letter. Since we don't know _a priori_ if a given
parquet file meets that criterion, we perform this reconciliation for every
parquet-based hive table. Further, for partitioned tables, every partition's
parquet schema is read from disk and merged together. For tables with large
numbers of partitions this is a very costly operation.
I've submitted a tentative PR (#14690) for performing partition pruning of
metastore tables by querying the metastore for partition metadata rather than
reading all of the partitions from disk. Of course, if we still read all of the
parquet file schema from disk to do reconciliation with the metastore, we lose
the benefits of this major optimization. So in #14690 we omit metastore
reconciliation entirely. This breaks compatibility with tables whose parquet
files include fields with uppercase letters. We don't want that. :) As it
stands, this is a blocker for merging #14690.
This PR seems to fix that scenario by returning the case-sensitive table
schema from the external catalog. Thus I'd love to see this get merged in.
If we can't get the case-sensitive table schema from the catalog, then we
need to find an alternative solution to make #14690 viable. One compromise
solution is to still perform the metastore schema reconciliation, but restrict
it solely to those partitions which pass the partition pruning filters. While
less optimal compared to a single metastore query, it would be a significant
improvement over the status quo.
I hope that helps.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]