[
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781966#comment-17781966
]
Abhinav Kumar commented on SPARK-36786:
---------------------------------------
[~ashahid7] [[email protected]] where are we on this one?
> SPIP: Improving the compile time performance, by improving a couple of
> rules, from 24 hrs to under 8 minutes
> -------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-36786
> URL: https://issues.apache.org/jira/browse/SPARK-36786
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.1, 3.1.2
> Reporter: Asif
> Priority: Major
> Labels: SPIP
>
> h2. Q1. What are you trying to do? Articulate your objectives using
> absolutely no jargon.
> The aim is to improve the compile time performance of query which in
> WorkDay's use case takes > 24 hrs ( & eventually fails) , to < 8 min.
> To explain the problem, I will provide the context.
> The query plan in our production system, is huge, with nested *case when*
> expressions ( level of nesting could be > 8) , where each *case when* can
> have branches sometimes > 1000.
> The plan could look like
> {quote}Project1
> |
> Filter 1
> |
> Project2
> |
> Filter2
> |
> Project3
> |
> Filter3
> |
> Join
> {quote}
> Now the optimizer has a Batch of Rules , intended to run at max 100 times.
> *Also note that the, the batch will continue to run till one of the condition
> is satisfied*
> *i.e either numIter == 100 || inputPlan == outputPlan (idempotency is
> achieved)*
> One of the early Rule is *PushDownPredicateRule.*
> **Followed by **CollapseProject**.
>
> The first issue is *PushDownPredicate* rule.
> It picks one filter at a time & pushes it at lowest level ( I understand
> that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but
> either case it picks 1 filter at time starting from top, in each iteration.
> *The above comment is no longer true in 3.1 release as it now combines
> filters. so it does push now all the encountered filters in a single pass.
> But it still materializes the filter on each push by realiasing.*
> So if there are say 50 projects interspersed with Filters , the idempotency
> is guaranteedly not going to get achieved till around 49 iterations.
> Moreover, CollapseProject will also be modifying tree on each iteration as a
> filter will get removed within Project.
> Moreover, on each movement of filter through project tree, the filter is
> re-aliased using transformUp rule. transformUp is very expensive compared to
> transformDown. As the filter keeps getting pushed down , its size increases.
> To optimize this rule , 2 things are needed
> # Instead of pushing one filter at a time, collect all the filters as we
> traverse the tree in that iteration itself.
> # Do not re-alias the filters on each push. Collect the sequence of projects
> it has passed through, and when the filters have reached their resting
> place, do the re-alias by processing the projects collected in down to up
> manner.
> This will result in achieving idempotency in a couple of iterations.
> *How reducing the number of iterations help in performance*
> There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals
> ( ... there are around 6 more such rules)* which traverse the tree using
> transformUp, and they run unnecessarily in each iteration , even when the
> expressions in an operator have not changed since the previous runs.
> *I have a different proposal which I will share later, as to how to avoid the
> above rules from running unnecessarily, if it can be guaranteed that the
> expression is not going to mutate in the operator.*
> The cause of our huge compilation time has been identified as the above.
>
> h2. Q2. What problem is this proposal NOT designed to solve?
> It is not going to change any runtime profile.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Like mentioned above , currently PushDownPredicate pushes one filter at a
> time & at each Project , it materialized the re-aliased filter. This
> results in large number of iterations to achieve idempotency as well as
> immediate materialization of Filter after each Project pass,, results in
> unnecessary tree traversals of filter expression that too using transformUp.
> and the expression tree of filter is bound to keep increasing as it is pushed
> down.
> h2. Q4. What is new in your approach and why do you think it will be
> successful?
> In the new approach we push all the filters down in a single pass. And do not
> materialize filters as it pass through Project. Instead keep collecting
> projects in sequential order and materialize the final filter once its final
> position is achieved ( above a join , in case of 2.1 , or above the base
> relation etc).
> This approach when coupled with the logic of identifying those Project
> operator whose expressions will not mutate ( which I will share later) , so
> that rules like
> NullPropagation,
> OptimizeIn.,
> LikeSimplification.,
> BooleanSimplification.,
> SimplifyConditionals.,
> RemoveDispensableExpressions.,
> SimplifyBinaryComparison.,
> SimplifyCaseConversionExpressions.,
> SimplifyExtractValueOps
>
> are applied only in first pass on the expressions of that Project operator,
> the compilation time of offending queries have been reduced to under 8 mins
> from 24 hrs or more.
>
> h2. Q5. Who cares? If you are successful, what difference will it make?
> For My company WorkDay, it will solve the currently failing plans due to OOM
> & compilation time running into 24 hrs or so. I have a PR for this locally,
> will publish it in some time.
>
> h2. Q6. What are the risks?
> The risk in the change of PushDownPredicate is very low.
> For the next proposal of identifying Project operator whose expressions will
> be immutable such that the above set of rules run only once has some
> relatively complex logic but with extra tests coverage it should be safe.
> h2. Q7. How long will it take?
> The basic changes are already in place. tests will take time. around 10 -15
> days.
> h2. Q8. What are the mid-term and final “exams” to check for success?
> All tests should pass.
> The perf benefit should justify the changes.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]