Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21230
Hi @rdblue , thanks for your new approach! Like you said, the major problem
is about statistics. This is unfortunately a problem of Spark's CBO design: the
statistics should belong to physical node but it currently belongs to logical
node.
For file-based data sources, since they are builtin sources, we can create
rules to update statistics at logical phase, i.e. `PruneFileSourcePartitions`.
But for external sources like iceberg, we would not be able to update
statistics before planning, and shuffle join may be wrongly planned while
broadcast join is applicable. In other words, users may need to create custom
optimizer rules to make their data source work well.
That said, I do like your approach if we can fix the statistics problem
first. I'm not sure how hard and how soon it can be fixed, cc @wzhfy
Before that, I'd like to still keep the pushdown logic in optimizer and
left the hard work to Spark instead of users. What do you think?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]