sigmod commented on pull request #32298: URL: https://github.com/apache/spark/pull/32298#issuecomment-1076143230
> `CTERef` are invented for 1 special purpose - (1) yes, we need CTERef for correctness when a CTE definition is non-deterministic; - (2) however, `CTERef` is also a primitive for de-duplicate common plan subtrees. Those plan trees to be shared do not have to be identical, e.g., one can merge filter predicates with `OR` and union needed columns into a single, shared CTE definition. Other query engines do that, even though Spark doesn't do that for now. E.g., this paper describes such optimizations: http://www.vldb.org/pvldb/vol8/p1704-elhelw.pdf. - (3) I think what this PR does is a special case of (2). E.g., if you have two plan subtrees (within the same query plan, but not subqueries) run different aggregations over the same table with the same grouping exprs, we can use `CTERef` but not `CommonSubqueries` to share the scan and computation. - (4) subqueries might present more optimization opportunities, but I think the additional optimizations would better come up in physical plans rather than logical plans. > I'm not sure I get this. Why ColumnPruning should consider these new nodes? There's a pattern matching for CTE: https://github.com/apache/spark/blob/efe43306fcab18f076f755c81c0406ebc1a5fee9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L874-L880 Similarly, if a scalar subquery reference was pruned by some other optimizations, we may want to remove the subquery too. > how about combining them into one node (CommonDefinitions?) that can host CTEs and scalar subqueries as well? The difference seems minor at logical level - but the latter avoid things like SubqueryReference: CommonDef +- Seq(Subquery) v.s. CommonDef +- Seq(Plan) but wrap CTE into a scalar subquery of (Select .. FROM cte) at the place of original subqueries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
