Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/14750
  
    @clockfly I can offer one answer to your question. One of the main benefits 
of this change is to allow us to remove the costly schema reconciliation 
between hive metastore schema and on-disk parquet file schema. Currently this 
reconciliation must be performed for parquet files whose schema include a 
column with an uppercase letter. Since we don't know _a priori_ if a given 
parquet file meets that criterion, we perform this reconciliation for every 
parquet-based hive table. Further, for partitioned tables, every partition's 
parquet schema is read from disk and merged together. For tables with large 
numbers of partitions this is a very costly operation.
    
    I've submitted a tentative PR (#14690) for performing partition pruning of 
metastore tables by querying the metastore for partition metadata rather than 
reading all of the partitions from disk. Of course, if we still read all of the 
parquet file schema from disk to do reconciliation with the metastore, we lose 
the benefits of this major optimization. So in #14690 we omit metastore 
reconciliation entirely. This breaks compatibility with tables whose parquet 
files include fields with uppercase letters. We don't want that. :) As it 
stands, this is a blocker for merging #14690.
    
    This PR seems to fix that scenario by returning the case-sensitive table 
schema from the external catalog. Thus I'd love to see this get merged in.
    
    If we can't get the case-sensitive table schema from the catalog, then we 
need to find an alternative solution to make #14690 viable. One compromise 
solution is to still perform the metastore schema reconciliation, but restrict 
it solely to those partitions which pass the partition pruning filters. While 
less optimal compared to a single metastore query, it would be a significant 
improvement over the status quo.
    
    I hope that helps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to