[ 
https://issues.apache.org/jira/browse/SPARK-30072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Afroozeh updated SPARK-30072:
---------------------------------
    Description: 
This PR changes subquery planning by calling the planner and plan preparation 
rules on the subquery plan directly. Before we were creating a QueryExecution 
instance for subqueries to get the executedPlan. This would re-run analysis and 
optimization on the subqueries plan. Running the analysis again on an optimized 
query plan can have unwanted consequences, as some rules, for example 
DecimalPrecision, are not idempotent.

As an example, consider the expression 1.7 * avg(a) which after applying the 
DecimalPrecision rule becomes:

promote_precision(1.7) * promote_precision(avg(a))

After the optimization, more specifically the constant folding rule, this 
expression becomes:

1.7 * promote_precision(avg(a))

Now if we run the analyzer on this optimized query again, we will get:

promote_precision(1.7) * promote_precision(promote_precision(avg(a)))

Which will later optimized as:

1.7 * promote_precision(promote_precision(avg(a)))

As can be seen, re-running the analysis and optimization on this expression 
results in an expression with extra nested promote_preceision nodes. Adding 
unneeded nodes to the plan is problematic because it can eliminate situations 
where we can reuse the plan.

We opted to introduce dedicated planners for subuqueries, instead of making the 
DecimalPrecision rule idempotent, because this eliminates this entire category 
of problems. Another benefit is that planning time for subqueries is reduced.

> Create dedicated planner for subqueries
> ---------------------------------------
>
>                 Key: SPARK-30072
>                 URL: https://issues.apache.org/jira/browse/SPARK-30072
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Ali Afroozeh
>            Priority: Minor
>
> This PR changes subquery planning by calling the planner and plan preparation 
> rules on the subquery plan directly. Before we were creating a QueryExecution 
> instance for subqueries to get the executedPlan. This would re-run analysis 
> and optimization on the subqueries plan. Running the analysis again on an 
> optimized query plan can have unwanted consequences, as some rules, for 
> example DecimalPrecision, are not idempotent.
> As an example, consider the expression 1.7 * avg(a) which after applying the 
> DecimalPrecision rule becomes:
> promote_precision(1.7) * promote_precision(avg(a))
> After the optimization, more specifically the constant folding rule, this 
> expression becomes:
> 1.7 * promote_precision(avg(a))
> Now if we run the analyzer on this optimized query again, we will get:
> promote_precision(1.7) * promote_precision(promote_precision(avg(a)))
> Which will later optimized as:
> 1.7 * promote_precision(promote_precision(avg(a)))
> As can be seen, re-running the analysis and optimization on this expression 
> results in an expression with extra nested promote_preceision nodes. Adding 
> unneeded nodes to the plan is problematic because it can eliminate situations 
> where we can reuse the plan.
> We opted to introduce dedicated planners for subuqueries, instead of making 
> the DecimalPrecision rule idempotent, because this eliminates this entire 
> category of problems. Another benefit is that planning time for subqueries is 
> reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to