Re: [PR] [SPARK-44571][SQL] Merge subplans with one row result [spark]

via GitHub Thu, 13 Nov 2025 04:48:54 -0800


peter-toth commented on code in PR #53019:
URL: https://github.com/apache/spark/pull/53019#discussion_r2523316353



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala:
##########
@@ -1,241 +0,0 @@
-/*

Review Comment:
   I reverted the scala docs change of `MergeSubplans` in 
https://github.com/apache/spark/pull/53019/commits/335b59352c67fcfa94500218f1b36107c26ef0f6
 for now to make the 2 files much similar. GitHub recognizes the file rename 
now and shows the real diff. This way the review should be much easier.
   
   I can update the scala docs to the following in a follow-up PR:
   ```
   /**
    * This rule tries to merge multiple subplans that have one row result. This 
can be either the plan
    * tree of a [[ScalarSubquery]] expression or the plan tree starting at a 
non-grouping [[Aggregate]]
    * node.
    *
    * The process is the following:
    * - While traversing through the plan each one row returning subplan is 
tried to merge into already
    *   seen one row returning subplans using `PlanMerger`s.
    *   During this first traversal each [[ScalarSubquery]] expression is 
replaced to a temporal
    *   [[ScalarSubqueryReference]] and each non-grouping [[Aggregate]] node is 
replaced to a temporal
    *   [[NonGroupingAggregateReference]] pointing to its possible merged 
version in `PlanMerger`s.
    *   `PlanMerger`s keep track of whether a plan is a result of merging 2 or 
more subplans, or is an
    *   original unmerged plan.
    *   [[ScalarSubqueryReference]]s and [[NonGroupingAggregateReference]]s 
contain all the required
    *   information to either restore the original subplan or create a 
reference to a merged CTE.
    * - Once the first traversal is complete and all possible merging have been 
done, a second
    *   traversal removes the references to either restore the original 
subplans or to replace the
    *   original to a modified ones that reference a CTE with a merged plan.
    *   A modified [[ScalarSubquery]] is constructed like:
    *   `GetStructField(ScalarSubquery(CTERelationRef to the merged plan), 
merged output index)`
    *   ans a modified [[Aggregate]] is constructed like:
    *   ```
    *   Project(
    *     Seq(
    *       GetStructField(
    *         ScalarSubquery(CTERelationRef to the merged plan),
    *         merged output index 1),
    *       GetStructField(
    *         ScalarSubquery(CTERelationRef to the merged plan),
    *         merged output index 2),
    *       ...),
    *     OneRowRelation)
    *   ```
    *   where `merged output index`s are the index of the output attributes (of 
the CTE) that
    *   correspond to the output of the original node.
    * - If there are merged subqueries in `PlanMerger`s then a `WithCTE` node 
is built from these
    *   queries. The `CTERelationDef` nodes contain the merged subplans in the 
following form:
    *   `Project(Seq(CreateNamedStruct(name1, attribute1, ...) AS mergedValue), 
mergedSubplan)`.
    *   The definitions are flagged that they host a subplan, that can return 
maximum one row.
    *
    * Here are a few examples:
    *
    * 1. a query with 2 subqueries:
    * ```
    * Project [scalar-subquery [] AS scalarsubquery(), scalar-subquery [] AS 
scalarsubquery()]
    * :  :- Aggregate [min(a) AS min(a)]
    * :  :  +- Relation [a, b, c]
    * :  +- Aggregate [sum(b) AS sum(b)]
    * :     +- Relation [a, b, c]
    * +- OneRowRelation
    * ```
    * is optimized to:
    * ```
    * WithCTE
    * :- CTERelationDef 0
    * :  +- Project [named_struct(min(a), min(a), sum(b), sum(b)) AS 
mergedValue]
    * :     +- Aggregate [min(a) AS min(a), sum(b) AS sum(b)]
    * :        +- Relation [a, b, c]
    * +- Project [scalar-subquery [].min(a) AS scalarsubquery(),
    *             scalar-subquery [].sum(b) AS scalarsubquery()]
    *    :  :- CTERelationRef 0
    *    :  +- CTERelationRef 0
    *    +- OneRowRelation
    * ```
    *
    * 2. a query with 2 non-grouping aggregates:
    * ```
    * Join Inner
    * :- Aggregate [min(a) AS min(a)]
    * :  +- Relation [a, b, c]
    * +- Aggregate [sum(b) AS sum(b), avg(cast(c as double)) AS avg(c)]
    *    +- Relation [a, b, c]
    * ```
    * is optimized to:
    * ```
    * WithCTE
    * :- CTERelationDef 0
    * :  +- Project [named_struct(min(a), min(a), sum(b), sum(b), avg(c), 
avg(c)) AS mergedValue]
    * :     +- Aggregate [min(a) AS min(a), sum(b) AS sum(b), avg(cast(c as 
double)) AS avg(c)]
    * :        +- Relation [a, b, c]
    * +- Join Inner
    *    :- Project [scalar-subquery [].min(a) AS min(a)]
    *    :  :  +- CTERelationRef 0
    *    :  +- OneRowRelation
    *    +- Project [scalar-subquery [].sum(b) AS sum(b), scalar-subquery 
[].avg(c) AS avg(c)]
    *       :  :- CTERelationRef 0
    *       :  +- CTERelationRef 0
    *       +- OneRowRelation
    * ```
    *
    * 3. a query with a subquery and a non-grouping aggregate:
    * ```
    * Join Inner
    * :- Project [scalar-subquery [] AS scalarsubquery()]
    * :  :  +- Aggregate [min(a) AS min(a)]
    * :  :     +- Relation [a, b, c]
    * :  +- OneRowRelation
    * +- Aggregate [sum(b) AS sum(b), avg(cast(c as double)) AS avg(c)]
    *    +- Relation [a, b, c]
    * ```
    * is optimized to:
    * ```
    * WithCTE
    * :- CTERelationDef 0
    * :  +- Project [named_struct(min(a), min(a), sum(b), sum(b), avg(c), 
avg(c)) AS mergedValue]
    * :     +- Aggregate [min(a) AS min(a), sum(b) AS sum(b), avg(cast(c as 
double)) AS avg(c)]
    * :        +- Relation [a, b, c]
    * +- Join Inner
    *    :- Project [scalar-subquery [].min(a) AS scalarsubquery()]
    *    :  :  +- CTERelationRef 0
    *    :  +- OneRowRelation
    *    +- Project [scalar-subquery [].sum(b) AS sum(b), scalar-subquery 
[].avg(c) AS avg(c)]
    *       :  :- CTERelationRef 0
    *       :  +- CTERelationRef 0
    *       +- OneRowRelation
    * ```
    */
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-44571][SQL] Merge subplans with one row result [spark]

Reply via email to