Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/14690
  
    I've been thinking some more about the metastore/file schema 
reconciliation. As I mentioned in the PR description, this patch omits this 
reconciliation. This causes failures when the parquet files have columns with 
upper case characters, among other cases.
    
    I mentioned trying to add a new schema method to the relation or its file 
catalog which would read the file schema for only a subset of the partitions. 
However, I quickly determined that would require a fundamental change in the 
way `BaseRelation` instances are handled throughout the codebase and decided 
that was a nonstarter.
    
    I have another idea I'd like your thoughts on. Rather than defer partition 
pruning to physical planning, what if we did that in the optimization phase and 
performed the hive `MetastoreRelation` conversion in an optimization rule? This 
rule would build a `HadoopFsRelation` based on the partition pruning 
predicates. We could add it to a batch after all of the pruning predicates have 
been pushed down and set it to run once. In that kind of scenario, I believe we 
could leave the `HadoopFsRelation` and `FileCatalog` types as they are in the 
current codebase. We'd just be building a `FileCatalog` restricted to the 
partitions we identify in the optimization phase. Physical planning would 
proceed as usual.
    
    In this approach, we should be able to do "just in time" schema 
reconciliation/merging based only on the selected partitions. *And* we could 
address an issue I've long been frustrated by—the fact that a 
`BaseRelation`'s size is defined irregardless of the partitions we're using. By 
creating a `HadoopFsRelation` restricted to the partitions we care about, we 
can compute a precise size for the relation tailored to our query. This would 
greatly help the automatic broadcast join conversion heuristic.
    
    Thoughts? I may start fiddling around with this when I have time today. 
It's possible I'm missing a crucial detail that makes this approach impractical.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to