Github user nsyca commented on the issue:
https://github.com/apache/spark/pull/14912
Thanks, @gatorsmile, for mentioning me. I will try my best to comment on
this thread. Disclaimer: I have not looked at the existing code manipulating
predicates/expressions in Spark. Nor have I the code in this PR. I am writing
my comment here based solely on the comments I read in this PR.
One of the goals of predicate transformation, in general, is to aid the
predicate pushdown. If a new form of a predicate, or a derived form of a
superset of a predicate is to be generated, it should be because there is a
potential the new form or the derived form can be pushed down further the plan.
Another goal of the transformation is because the new form has a potential
to be simplified further.
Taking the example of ``(a > 10 || b > 2) && (a > 10 || c == 3)``, I don't
see any benefit of transforming to ``(a > 10) || (b > 2 && c == 3)`` as it will
form a disjunctive predicate. If only ``b == c`` by transitivity rule then we
may want to do that in order to simplify further to ``(a > 10 || c == 3``
(because ``b == c`` and ``c > 2 && c == 3`` can be reduced to ``c == 3``.
The most benefit in the topic of predicate transformation is the equality
transitivity property as equality predicates are commonly used in SQL queries.
I remember there were a few JIRAs opened, but deferred, to solve this problem.
There are some capability in the current version to propagate the equality
transitivity but the behaviour is not consistent.
Predicate transformation like extracting common subterms. An example is the
predicate ``(a=1 || b=2) && (a=1 || c=4)`` and a is a column from a different
stream of columns b and c should be transformed to ``a=1 && (b=2 || c=4)``. A
more complex case is the predicate ``(a=1 || b=2) && (a=3 || c=4)`` should have
a new predicate ``(a=1 || a=3)`` added as a superset predicate to early filter
the stream of a to just the two values needed.
Introducing superset, redundant predicates like the last example above will
complicate the computation of filter ratios of the predicates on a given stream
when we introduce the Cost-based Optimization, which I assume depends on a good
estimate of filter ratios on a given stream. This is because we cannot make
assumption on the independent filtering affects among a set of predicates. Here
the filter ratio of the newly generated superset predicate should be ignored in
the filtering estimate.
Another goal of predicate transformation is to derive contradiction and/or
tautology. This is achieved by building the inequality relationships among the
same column of a set of predicate. A simple example is ``a>1 && a < 1`` should
be evaluated to ``false`` at the compile time and eliminate the scan of the
stream completely. The stream is treated like producing an empty set. Depending
on the context, the stream may be substituted by a NULL row when it is a
subquery in an existential (EXISTS) or a universal (ALL) subquery, or a
singleton NULL value when it is a scalar subquery.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]