[
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-33152:
----------------------------------
Target Version/s: (was: 4.1.0)
> SPIP: Constraint Propagation code causes OOM issues or increasing compilation
> time to hours
> -------------------------------------------------------------------------------------------
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.5.0, 4.1.0, 4.0.0
> Reporter: Asif
> Priority: Major
> Labels: SPIP, pull-request-available
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios
> involving aliases( with certain use cases the compilation time can go into
> hours), potential to cause OOM, may miss removing redundant filters in
> different scenarios, may miss creating IsNotNull constraints in different
> scenarios, does not push compound predicates in Join.
> # This issue if not fixed can cause OutOfMemory issue or unacceptable query
> compilation times.
> Have added a test "plan equivalence with case statements and performance
> comparison with benefit of more than 10x conservatively" in
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite.
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
> # It is more effective in filter pruning as is evident in some of the tests
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite
> where current code is not able to identify the redundant filter in some cases.
> # It is able to generate a better optimized plan for join queries as it can
> push compound predicates.
> # The current logic can miss a lot of possible cases of removing redundant
> predicates, as it fails to take into account if same attribute or its aliases
> are repeated multiple times in a complex expression.
> # There are cases where some of the optimizer rules involving removal of
> redundant predicates fail to remove on the basis of constraint data. In some
> cases the rule works, just by the virtue of previous rules helping it out to
> cover the inaccuracy. That the ConstraintPropagation rule & its function of
> removal of redundant filters & addition of new inferred filters is dependent
> on the working of some of the other unrelated previous optimizer rules is
> behaving, is indicative of issues.
> # It does away with all the EqualNullSafe constraints as this logic does not
> need those constraints to be created.
> # There is at least one test in existing ConstraintPropagationSuite which is
> missing a IsNotNull constraints because the code incorrectly generated a
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the
> existing Constraints code. With these changes, the test correctly creates an
> EqualTo constraint, resulting in an inferred IsNotNull constraint
> # It does away with the current combinatorial logic of evaluation all the
> constraints can cause compilation to run into hours or cause OOM. The number
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit
> run time characteristics too, like inferring IsNotNull filter or pushing down
> compound predicates on the join, which currently may get missed/ does not
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all
> the possible combinations of constraints , based on the aliases ( even then
> it may miss a lot of combinations if the expression is a complex expression
> involving same attribute repeated multiple times within the expression and
> there are many aliases to that column). There are query plans in our
> production env, which can result in intermediate number of constraints going
> into hundreds of thousands, causing OOM or taking time running into hours.
> Also there are cases where it incorrectly generates an EqualNullSafe
> constraint instead of EqualTo constraint , thus missing a possible IsNull
> constraint on column.
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required ones, and
> the plan apparently is behaving correctly only due to the preceding unrelated
> optimizer rule. Have Test which show that with the bare mnimum rules
> containing RemoveRedundantPredicate, it misses the removal of redundant
> predicate.
> h2. Q4. What is new in your approach and why do you think it will be
> successful?
> It solves all the above mentioned issues.
> # The number of constraints created are same as the number of filters. No
> combinatorial creation of constraints. No need for EqualsNullSafe constraint
> on aliases.
> # Can remove redundant predicates on any expression involving aliases
> irrespective of the number of repeat occurences in all possible combination.
> # Brings down query compilation time to few minutes from hours.
> # Can push compound predicates on Joins & infer right number of IsNotNull
> constraints which can impact query runtime also positively.
> # The proposed algorithm has been running successfully in our env. (WorkDay)
> for months & has solved all the above issues.
> h2. Q5. Who cares? If you are successful, what difference will it make?
> For My company WorkDay, it has solved the previously failing plans due to OOM
> & compilation time running into 10 hrs or so. I suppose there have been
> previous attempts too, to fix this issue, but did not make progress due to
> complexity of change.
> The PR for the same is
> [https://github.com/apache/spark/pull/49117|https://github.com/apache/spark/pull/49117]
> h2. Q6. What are the risks?
> Well the changes are little extensive, but thoroughly tested ( old & many new
> tests added). Have added a lot of tests for Union node, as found that current
> constraints tests were not sufficient for Union case.
> So in that sense , given that all existing tests as well as new tests are
> clean, this is a safe PR.
> h2. Q7. How long will it take?
> The PR is already there. Implementation already done. whatever time needed is
> for review and discussion.
> [https://github.com/apache/spark/pull/49117|https://github.com/apache/spark/pull/49117]
> h2. Q8. What are the mid-term and final “exams” to check for success?
> All tests should pass.
> The perf benefit should justify the changes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]