[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-33152:
-------------------------
    Description: 
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

Proposing new algorithm to create, store and use constraints for removing 
redundant filters & inferring new filters.
The current algorithm has subpar performance in complex expression scenarios 
involving aliases( with certain use cases the compilation time can go into 
hours), potential to cause OOM,  may miss removing redundant filters in 
different scenarios, may miss creating IsNotNull constraints in different 
scenarios,  does not push compound predicates in Join.

# This issue if not fixed can cause OutOfMemory issue or unacceptable query 
compilation times.
Have added  a test "plan equivalence with case statements and performance 
comparison with benefit of more than 10x conservatively" in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. With 
this PR the compilation time is 247 ms vs 13958 ms without the change
# It is more effective in filter pruning as is evident in some of the tests in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite where 
current code is not able to identify the redundant filter in some cases.
# It is able to generate a better optimized plan for join queries as it can 
push compound predicates.
# The current logic can miss a lot of possible cases of removing redundant 
predicates, as it fails to take into account if same attribute or its aliases 
are repeated multiple times in a complex expression.
# There are cases where some of the optimizer rules involving removal of 
redundant predicates fail to remove on the basis of constraint data. In some 
cases the rule works, just by the virtue of previous rules helping it out to 
cover the inaccuracy. That the ConstraintPropagation rule & its function of 
removal of redundant filters & addition of new inferred filters is dependent on 
the working of some of the other unrelated previous optimizer rules is 
behaving, is indicative of issues.
# It does away with all the EqualNullSafe constraints as this logic does not 
need those constraints to be created.
# There is at least one test in existing ConstraintPropagationSuite which is 
missing a IsNotNull constraints because the code incorrectly generated a 
EqualsNullSafeConstraint instead of EqualTo constraint, when using the existing 
Constraints code. With these changes, the test correctly creates an EqualTo 
constraint, resulting in an inferred IsNotNull constraint
# It does away with the current combinatorial logic of evaluation all the 
constraints can cause compilation to run into hours or cause OOM. The number of 
constraints stored is exactly the same as the number of filters encountered

h2. Q2. What problem is this proposal NOT designed to solve?
It mainly focuses on compile time performance, but in some cases can benefit 
run time characteristics too, like inferring IsNotNull filter or pushing down 
compound predicates on the join, which currently may get missed/ does not 
happen , respectively, by the present code.

h2. Q3. How is it done today, and what are the limits of current practice?
Current ConstraintsPropagation code,  pessimistically tries to generates all 
the possible combinations of constraints , based on the aliases ( even then it 
may miss a lot of  combinations if the expression is a complex expression 
involving same attribute repeated multiple times within the expression and 
there are many aliases to that column).  There are query plans in our 
production env, which can result in intermediate number of constraints going 
into hundreds of thousands, causing OOM or taking time running into hours.  
Also there are cases where it incorrectly generates an EqualNullSafe constraint 
instead of EqualTo constraint , thus missing a possible IsNull constraint on 
column. 
Also it only pushes single column predicate on the other side of the join.
The constraints generated , in some cases, are missing the required ones, and 
the plan apparently is behaving correctly only due to the preceding unrelated 
optimizer rule. Have Test which show that with the bare mnimum rules containing 
RemoveRedundantPredicate, it misses the removal of redundant predicate.

h2. Q4. What is new in your approach and why do you think it will be successful?
It solves all the above mentioned issues. 
# The number of constraints created are same as the number of filters.  No 
combinatorial creation of constraints. No need for EqualsNullSafe constraint on 
aliases.
#  Can remove redundant predicates on any expression involving aliases 
irrespective of the number of repeat occurences in all possible combination.
# Brings down query compilation time to few minutes from hours.
# Can push compound predicates on Joins & infer right number of IsNotNull 
constraints which can impact query runtime also positively.
# The proposed algorithm has been running successfully in our env.  (WorkDay) 
for months & has solved all the above issues.

h2. Q5. Who cares? If you are successful, what difference will it make?
For My company WorkDay, it has solved the previously failing plans due to OOM & 
compilation time running into 10 hrs or so. I suppose there have been previous 
attempts too, to fix this issue, but did not make progress due to complexity of 
change.
The PR for the same is 
[PR for this SPIP|https://github.com/apache/spark/pull/33983]


h2. Q6. What are the risks?
Well the changes are little extensive, but thoroughly tested ( old & many new 
tests added). Have added a lot of tests for Union node, as found that current 
constraints tests were not sufficient for Union case.
So in that sense , given that all existing tests as well as new tests are 
clean, this is a safe PR.

h2. Q7. How long will it take?
The PR is already there. Implementation already done.  whatever time needed is 
for review and discussion.

h2. Q8. What are the mid-term and final “exams” to check for success?
All tests should pass.
The perf benefit should justify the changes.

  was:
We encountered this issue at Workday. 

The issue is that current Constraints Propagation code pessimistically 
generates all the possible permutations of base constraint for the aliases in 
the project node.

This causes blow up of the number of constraints generated causing OOM issues 
at compile time of sql query, or queries taking 18 min to 2 hrs to compile.

The problematic piece of code is in LogicalPlan.getAliasedConstraints

projectList.foreach {
 case a @ Alias(l: Literal, _) =>
 allConstraints += EqualNullSafe(a.toAttribute, l)
 case a @ Alias(e, _) =>
 // For every alias in `projectList`,replace the reference in
 // constraints by its attribute.
 allConstraints ++= allConstraints.map(_ transform {
 case expr: Expression if expr.semanticEquals(e) =>
 a.toAttribute
 })
 allConstraints += EqualNullSafe(e, a.toAttribute)
 case _ => // Don't change.
 }

so consider a hypothetical plan

 

Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
c2 , c as c3)

   |

Filter f(a, b, c)

|

Base Relation (a, b, c)

and so we have projection as

a, a1, a2, a3

b, b1, b2

c, c1, c2, c3

Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
times, and C occurring 3 times.

So at project node the number of constraints for a single base constraint f(a, 
b, c) will be

4C1 * 3C2 * 4C3 = 48

In our case, we have seen number of constraints going up to > 30000 or more, as 
there are complex case statements in the projection.

Spark generates all these constraints pessimistically for pruning filters or 
push down predicates for join , it may encounter when the optimizer traverses 
up the tree.

 

This is issue is solved at our end by modifying the spark code to use a 
different logic.

The idea is simple. 

Instead of generating pessimistically all possible combinations of base 
constraint, just store the original base constraints & track the aliases at 
each level.

The principal followed is this:

1) Store the base constraint and keep the track of the aliases for the 
underlying attribute.

2) If the base attribute composing the constraint is not in the output set, see 
if the constraint survives by substituting the attribute getting removed with 
the next available alias's attribute.

 

For checking if a filter can be pruned , just canonicalize the filter with the 
attribute at 0th position of the tracking list & compare with the underlying 
base constraint.

To elaborate using  the plan above.

At project node

We have constraint f(a,b,c)

we keep track of alias

List 1  : a, a1.attribute, a2.attribute, a3.attribute

List2 :  b, b1.attribute, b2.attribute 

List3: c, c1.attribute, c2.attribute, c3.attribute

Lets say above the project node, we encounter a filter

f(a1, b2, c3)

So canonicalize the filter by using the above list data, to convert it to 

f(a,b c) & compare it with the stored base constraints.

 

For predicate push down , instead of generating all the redundant combinations 
of constraints , just generate one constraint per element of the alias.

In the current spark code , in any case, filter push down happens only for 1 
variable at a time.

So just expanding the filter (a,b,c) to

f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, c) 
, f(a, b, c1), f(a, b, c2), f(a, b, c3) 

would suffice, rather than generating all the redundant combinations.

In fact the code can be easily modified to generate only those constraints 
which involve variables forming the join condition. so the number of  
constraints generated on expand are further reduced.

We already have code to generate compound filters for push down ( join on 
multiple conditions), which can be used for single variable condition, push 
down too.

Just to elaborate the logic further, if we consider the above hypothetical plan 
(assume collapse project rule is not there)

 

Project (a1, a1. as a4, b,  c1, c1 as c4)

  |

Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
c2 , c as c3)

   |

Filter f(a, b, c)

|

Base Relation (a, b, c)

 

So at project node2, the constraints data will become

we keep track of alias

List 1  : a1, a4.attribute

List2 :  b

List3: c1, c4.attribute

And the constraint f(a, b, c) is modified to f(a1, b, c1)

 

The above logic is also more effective in filter pruning then the current code 
as some of our tests show. Besides minimal over head of constraint propagation.


> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33152
>                 URL: https://issues.apache.org/jira/browse/SPARK-33152
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.1, 3.1.2
>            Reporter: Asif
>            Priority: Major
>              Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM,  may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios,  does not push compound predicates in Join.
> # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added  a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. With 
> this PR the compilation time is 247 ms vs 13958 ms without the change
> # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
> # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
> # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
> # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
> # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
> # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
> # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code,  pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of  combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column).  There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours.  
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required ones, and 
> the plan apparently is behaving correctly only due to the preceding unrelated 
> optimizer rule. Have Test which show that with the bare mnimum rules 
> containing RemoveRedundantPredicate, it misses the removal of redundant 
> predicate.
> h2. Q4. What is new in your approach and why do you think it will be 
> successful?
> It solves all the above mentioned issues. 
> # The number of constraints created are same as the number of filters.  No 
> combinatorial creation of constraints. No need for EqualsNullSafe constraint 
> on aliases.
> #  Can remove redundant predicates on any expression involving aliases 
> irrespective of the number of repeat occurences in all possible combination.
> # Brings down query compilation time to few minutes from hours.
> # Can push compound predicates on Joins & infer right number of IsNotNull 
> constraints which can impact query runtime also positively.
> # The proposed algorithm has been running successfully in our env.  (WorkDay) 
> for months & has solved all the above issues.
> h2. Q5. Who cares? If you are successful, what difference will it make?
> For My company WorkDay, it has solved the previously failing plans due to OOM 
> & compilation time running into 10 hrs or so. I suppose there have been 
> previous attempts too, to fix this issue, but did not make progress due to 
> complexity of change.
> The PR for the same is 
> [PR for this SPIP|https://github.com/apache/spark/pull/33983]
> h2. Q6. What are the risks?
> Well the changes are little extensive, but thoroughly tested ( old & many new 
> tests added). Have added a lot of tests for Union node, as found that current 
> constraints tests were not sufficient for Union case.
> So in that sense , given that all existing tests as well as new tests are 
> clean, this is a safe PR.
> h2. Q7. How long will it take?
> The PR is already there. Implementation already done.  whatever time needed 
> is for review and discussion.
> h2. Q8. What are the mid-term and final “exams” to check for success?
> All tests should pass.
> The perf benefit should justify the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to