[GitHub] [spark] peter-toth edited a comment on pull request #32298: [SPARK-34079][SQL] Merge non-correlated scalar subqueries

GitBox Wed, 23 Mar 2022 06:44:14 -0700


peter-toth edited a comment on pull request #32298:
URL: https://github.com/apache/spark/pull/32298#issuecomment-1076292484



   > * (2) however, `CTERef` is also a primitive for de-duplicate common plan 
subtrees. Those plan trees to be shared do not have to be identical, e.g., one 
can merge filter predicates with `OR` and union needed columns into a single, 
shared CTE definition. Other query engines do that, even though Spark doesn't 
do that for now. E.g., this paper describes such optimizations: 
http://www.vldb.org/pvldb/vol8/p1704-elhelw.pdf.
   > * (3) I think what this PR does is a special case of (2). E.g., if you 
have two plan subtrees (within the same query plan, but not subqueries) run 
different aggregations over the same table with the same grouping exprs, we can 
use `CTERef` but not `CommonSubqueries` to share the scan and computation.
   
   In think there are 2 different things here:
   - The merging logic (`tryMergePlans()`) in this PR is general enough to 
handle common plan subtrees. It can be improved of course and actually I have a 
follow-up PR to support merging different filter predicates with `OR`, I just 
didn't want to make this PR more complex. I think the logic can be extracted 
from `MergeScalarSubqueries` to a common place and used in follow-up PRs for 
different purposes. Like, it can be used for the general (2) and in that case 
`CTERef` is the only viable way to reference common parts, indeed.
   - Merging subqueries seems to be much simpler to start with (than merging 
arbitrary common parts of a plan). It is well defined which plan should be 
tried to merged with another plan. This PR wants to deal with that scope only. 
But for this limited scope `CTERef`'s physical implementation seems to be a bit 
overkill, but a simple expression a right fit.
   
   > * (4) subqueries might present more optimization opportunities, but I 
think the additional optimizations would better come up in physical plans 
rather than logical plans.
   > 
   > > I'm not sure I get this. Why ColumnPruning should consider these new 
nodes?
   > 
   > There's a pattern matching for CTE:
   > 
   > 
https://github.com/apache/spark/blob/efe43306fcab18f076f755c81c0406ebc1a5fee9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L874-L880
   > Similarly, if a scalar subquery reference was pruned by some other 
optimizations, we may want to remove the subquery too.
   
   I think this code only prunes the plan child of `WithCTE` not the CTE 
definitions, but indeed, we need a `case p @ Project(_, w: <definition hosting 
node>) =>` pattern there.
   I'm not sure as in that case `PlanSubqueries` / `PlanAdaptiveSubqueries` 
simply don't add the physical `ScalarSubquery` to the plan. Likewise, we don't 
seem to remove any CTE definition that are no longer needed.
   
   > > how about combining them into one node (CommonDefinitions?) that can 
host CTEs and scalar subqueries as well?
   > 
   > The difference seems minor at logical level - but the latter avoid things 
like SubqueryReference:
   > 
   > CommonDef +- Seq(Subquery)
   > 
   > v.s. CommonDef +- Seq(Plan) but wrap CTE into a scalar subquery of (Select 
.. FROM cte) at the place of original subqueries.
   
   I don't feel that `SubqueryReference` is bad and we need to avoid it, but if 
others also suggest removing it, I can change the implementation to use only 
`WithCTE` + `CTERef`s.
   
   Please also consider my alternative proposal to rename `WithCTE` to 
`CommonDef` and keep `CTERef` for common query parts and `SubqueryReference` 
for common subquery expressions:
   
   ```
   CommonDef
   : +- Seq(Subquery)
   +- Seq(Plan)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth edited a comment on pull request #32298: [SPARK-34079][SQL] Merge non-correlated scalar subqueries

Reply via email to