Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/14619
@cloud-fan I've moved the `InsertRelationScanner` rule to `Analyzer`, after
relations and expressions are resolved. To reuse analyze and optimize rules, I
updated relative rules such as `CleanupAliases`ã `ColumnPruning`ã
`PushDownPredicate`ã `InferFiltersFromConstraints`ã
`ConvertToLocalRelation`ã `PropagateEmptyRelation`, I also added new rules to
combine and prune `Scanner` operators. Besides, I made some change in subquery
related rules and recently found they have been refactored.
Now that only a few of test cases is still failing, which should be easy to
fix. But, I realized adding a wrapper node over every relation maybe not a idea
that is perfect enough for the following reasons:
Firstly, scan a relation is not among basic operators in SQL language, when
we declare a relation, we imply it should be scanned, so It seems semantically
duplicate to declare a `Scanner` node over a relation or calling
`relation.scanner()`. Besides, to add this wrapper node, we have to make a new
assumption that no other operators should be inserted between `Scanner` and its
corresponding relation, this brought in more complexity.
Secondly, a wrapper node should contain the output, predicates that can be
used in partition pruning, and a relation to be scanned. But this may cause
complex situation in some cases, for example, in `InferFiltersFromConstraints`,
we have to covert expression in filters to alias name when we collect valid
constraints, because output maybe alias and filters have to use child
expression, this behavor is not needed in other operators.
At last, I feel adding such a operator have caused too many changes,
perhaps we should make some improvement on `PhysicalOperation`, until we figure
out a way comprehensively better than current method.
After all, I'm passionate to this improvement and will try my best to
contribute, please correct me if I'm wrong, thank you!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]