Github user mallman commented on the issue: https://github.com/apache/spark/pull/14750 @clockfly I can offer one answer to your question. One of the main benefits of this change is to allow us to remove the costly schema reconciliation between hive metastore schema and on-disk parquet file schema. Currently this reconciliation must be performed for parquet files whose schema include a column with an uppercase letter. Since we don't know _a priori_ if a given parquet file meets that criterion, we perform this reconciliation for every parquet-based hive table. Further, for partitioned tables, every partition's parquet schema is read from disk and merged together. For tables with large numbers of partitions this is a very costly operation. I've submitted a tentative PR (#14690) for performing partition pruning of metastore tables by querying the metastore for partition metadata rather than reading all of the partitions from disk. Of course, if we still read all of the parquet file schema from disk to do reconciliation with the metastore, we lose the benefits of this major optimization. So in #14690 we omit metastore reconciliation entirely. This breaks compatibility with tables whose parquet files include fields with uppercase letters. We don't want that. :) As it stands, this is a blocker for merging #14690. This PR seems to fix that scenario by returning the case-sensitive table schema from the external catalog. Thus I'd love to see this get merged in. If we can't get the case-sensitive table schema from the catalog, then we need to find an alternative solution to make #14690 viable. One compromise solution is to still perform the metastore schema reconciliation, but restrict it solely to those partitions which pass the partition pruning filters. While less optimal compared to a single metastore query, it would be a significant improvement over the status quo. I hope that helps.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org