[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown

nsyca Mon, 12 Sep 2016 08:17:49 -0700

Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/14912
  
    Thanks, @gatorsmile, for mentioning me. I will try my best to comment on 
this thread. Disclaimer: I have not looked at the existing code manipulating 
predicates/expressions in Spark. Nor have I the code in this PR. I am writing 
my comment here based solely on the comments I read in this PR.
    
    One of the goals of predicate transformation, in general, is to aid the 
predicate pushdown. If a new form of a predicate, or a derived form of a 
superset of a predicate is to be generated, it should be because there is a 
potential the new form or the derived form can be pushed down further the plan.
    
    Another goal of the transformation is because the new form has a potential 
to be simplified further.
    
    Taking the example of ``(a > 10 || b > 2) && (a > 10 || c == 3)``, I don't 
see any benefit of transforming to ``(a > 10) || (b > 2 && c == 3)`` as it will 
form a disjunctive predicate. If only ``b == c`` by transitivity rule then we 
may want to do that in order to simplify further to ``(a > 10 || c == 3`` 
(because ``b == c`` and ``c > 2 && c == 3`` can be reduced to ``c == 3``.
    
    The most benefit in the topic of predicate transformation is the equality 
transitivity property as equality predicates are commonly used in SQL queries. 
I remember there were a few JIRAs opened, but deferred, to solve this problem. 
There are some capability in the current version to propagate the equality 
transitivity but the behaviour is not consistent.
    
    Predicate transformation like extracting common subterms. An example is the 
predicate ``(a=1 || b=2) && (a=1 || c=4)`` and a is a column from a different 
stream of columns b and c should be transformed to ``a=1 && (b=2 || c=4)``. A 
more complex case is the predicate ``(a=1 || b=2) && (a=3 || c=4)`` should have 
a new predicate ``(a=1 || a=3)`` added as a superset predicate to early filter 
the stream of a to just the two values needed.
    
    Introducing superset, redundant predicates like the last example above will 
complicate the computation of filter ratios of the predicates on a given stream 
when we introduce the Cost-based Optimization, which I assume depends on a good 
estimate of filter ratios on a given stream. This is because we cannot make 
assumption on the independent filtering affects among a set of predicates. Here 
the filter ratio of the newly generated superset predicate should be ignored in 
the filtering estimate.
    
    Another goal of predicate transformation is to derive contradiction and/or 
tautology. This is achieved by building the inequality relationships among the 
same column of a set of predicate. A simple example is ``a>1 && a < 1`` should 
be evaluated to ``false`` at the compile time and eliminate the scan of the 
stream completely. The stream is treated like producing an empty set. Depending 
on the context, the stream may be substituted by a NULL row when it is a 
subquery in an existential (EXISTS) or a universal (ALL) subquery, or a 
singleton NULL value when it is a scalar subquery.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown

Reply via email to