peter-toth edited a comment on pull request #32298: URL: https://github.com/apache/spark/pull/32298#issuecomment-1076292484
> * (2) however, `CTERef` is also a primitive for de-duplicate common plan subtrees. Those plan trees to be shared do not have to be identical, e.g., one can merge filter predicates with `OR` and union needed columns into a single, shared CTE definition. Other query engines do that, even though Spark doesn't do that for now. E.g., this paper describes such optimizations: http://www.vldb.org/pvldb/vol8/p1704-elhelw.pdf. > * (3) I think what this PR does is a special case of (2). E.g., if you have two plan subtrees (within the same query plan, but not subqueries) run different aggregations over the same table with the same grouping exprs, we can use `CTERef` but not `CommonSubqueries` to share the scan and computation. In think there are 2 different things here: - The merging logic (`tryMergePlans()`) in this PR is general enough to handle common plan subtrees. It can be improved of course and actually I have a follow-up PR to support merging different filter predicates with `OR`, I just didn't want to make this PR more complex. I think the logic can be extracted from `MergeScalarSubqueries` to a common place and used in follow-up PRs for different purposes. Like, it can be used for the general (2) and in that case `CTERef` is the only viable way to reference common parts, indeed. - Merging subqueries seems to be much simpler to start with (than merging arbitrary common parts of a plan). It is well defined which plan should be tried to merged with another plan. This PR wants to deal with that scope only. But for this limited scope `CTERef`'s physical implementation seems to be a bit overkill, but a simple expression a right fit. > * (4) subqueries might present more optimization opportunities, but I think the additional optimizations would better come up in physical plans rather than logical plans. > > > I'm not sure I get this. Why ColumnPruning should consider these new nodes? > > There's a pattern matching for CTE: > > https://github.com/apache/spark/blob/efe43306fcab18f076f755c81c0406ebc1a5fee9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L874-L880 > Similarly, if a scalar subquery reference was pruned by some other optimizations, we may want to remove the subquery too. I think this code only prunes the plan child of `WithCTE` not the CTE definitions, but indeed, we need a `case p @ Project(_, w: <definition hosting node>) =>` pattern there. I'm not sure as in that case `PlanSubqueries` / `PlanAdaptiveSubqueries` simply don't add the physical `ScalarSubquery` to the plan. Likewise, we don't seem to remove any CTE definition that are no longer needed. > > how about combining them into one node (CommonDefinitions?) that can host CTEs and scalar subqueries as well? > > The difference seems minor at logical level - but the latter avoid things like SubqueryReference: > > CommonDef +- Seq(Subquery) > > v.s. CommonDef +- Seq(Plan) but wrap CTE into a scalar subquery of (Select .. FROM cte) at the place of original subqueries. I don't feel that `SubqueryReference` is bad and we need to avoid it, but if others also suggest removing it, I can change the implementation to use only `WithCTE` + `CTERef`s. Please also consider my alternative proposal to rename `WithCTE` to `CommonDef` and keep `CTERef` for common query parts and `SubqueryReference` for common subquery expressions: ``` CommonDef : +- Seq(Subquery) +- Seq(Plan) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
