Github user mallman commented on the issue:
https://github.com/apache/spark/pull/14690
I've been thinking some more about the metastore/file schema
reconciliation. As I mentioned in the PR description, this patch omits this
reconciliation. This causes failures when the parquet files have columns with
upper case characters, among other cases.
I mentioned trying to add a new schema method to the relation or its file
catalog which would read the file schema for only a subset of the partitions.
However, I quickly determined that would require a fundamental change in the
way `BaseRelation` instances are handled throughout the codebase and decided
that was a nonstarter.
I have another idea I'd like your thoughts on. Rather than defer partition
pruning to physical planning, what if we did that in the optimization phase and
performed the hive `MetastoreRelation` conversion in an optimization rule? This
rule would build a `HadoopFsRelation` based on the partition pruning
predicates. We could add it to a batch after all of the pruning predicates have
been pushed down and set it to run once. In that kind of scenario, I believe we
could leave the `HadoopFsRelation` and `FileCatalog` types as they are in the
current codebase. We'd just be building a `FileCatalog` restricted to the
partitions we identify in the optimization phase. Physical planning would
proceed as usual.
In this approach, we should be able to do "just in time" schema
reconciliation/merging based only on the selected partitions. *And* we could
address an issue I've long been frustrated byâthe fact that a
`BaseRelation`'s size is defined irregardless of the partitions we're using. By
creating a `HadoopFsRelation` restricted to the partitions we care about, we
can compute a precise size for the relation tailored to our query. This would
greatly help the automatic broadcast join conversion heuristic.
Thoughts? I may start fiddling around with this when I have time today.
It's possible I'm missing a crucial detail that makes this approach impractical.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]