peter-toth commented on code in PR #53019:
URL: https://github.com/apache/spark/pull/53019#discussion_r2523316353
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala:
##########
@@ -1,241 +0,0 @@
-/*
Review Comment:
I reverted the scala docs change of `MergeSubplans` for now to make the 2
files much similar. GitHub recognizes the file rename now and shows the real
diff. This way the review should be much easier.
I can update the scala docs to the following in a follow-up PR:
```
/**
* This rule tries to merge multiple subplans that have one row result. This
can be either the plan
* tree of a [[ScalarSubquery]] expression or the plan tree starting at a
non-grouping [[Aggregate]]
* node.
*
* The process is the following:
* - While traversing through the plan each one row returning subplan is
tried to merge into already
* seen one row returning subplans using `PlanMerger`s.
* During this first traversal each [[ScalarSubquery]] expression is
replaced to a temporal
* [[ScalarSubqueryReference]] and each non-grouping [[Aggregate]] node is
replaced to a temporal
* [[NonGroupingAggregateReference]] pointing to its possible merged
version in `PlanMerger`s.
* `PlanMerger`s keep track of whether a plan is a result of merging 2 or
more subplans, or is an
* original unmerged plan.
* [[ScalarSubqueryReference]]s and [[NonGroupingAggregateReference]]s
contain all the required
* information to either restore the original subplan or create a
reference to a merged CTE.
* - Once the first traversal is complete and all possible merging have been
done, a second
* traversal removes the references to either restore the original
subplans or to replace the
* original to a modified ones that reference a CTE with a merged plan.
* A modified [[ScalarSubquery]] is constructed like:
* `GetStructField(ScalarSubquery(CTERelationRef to the merged plan),
merged output index)`
* ans a modified [[Aggregate]] is constructed like:
* ```
* Project(
* Seq(
* GetStructField(
* ScalarSubquery(CTERelationRef to the merged plan),
* merged output index 1),
* GetStructField(
* ScalarSubquery(CTERelationRef to the merged plan),
* merged output index 2),
* ...),
* OneRowRelation)
* ```
* where `merged output index`s are the index of the output attributes (of
the CTE) that
* correspond to the output of the original node.
* - If there are merged subqueries in `PlanMerger`s then a `WithCTE` node
is built from these
* queries. The `CTERelationDef` nodes contain the merged subplans in the
following form:
* `Project(Seq(CreateNamedStruct(name1, attribute1, ...) AS mergedValue),
mergedSubplan)`.
* The definitions are flagged that they host a subplan, that can return
maximum one row.
*
* Here are a few examples:
*
* 1. a query with 2 subqueries:
* ```
* Project [scalar-subquery [] AS scalarsubquery(), scalar-subquery [] AS
scalarsubquery()]
* : :- Aggregate [min(a) AS min(a)]
* : : +- Relation [a, b, c]
* : +- Aggregate [sum(b) AS sum(b)]
* : +- Relation [a, b, c]
* +- OneRowRelation
* ```
* is optimized to:
* ```
* WithCTE
* :- CTERelationDef 0
* : +- Project [named_struct(min(a), min(a), sum(b), sum(b)) AS
mergedValue]
* : +- Aggregate [min(a) AS min(a), sum(b) AS sum(b)]
* : +- Relation [a, b, c]
* +- Project [scalar-subquery [].min(a) AS scalarsubquery(),
* scalar-subquery [].sum(b) AS scalarsubquery()]
* : :- CTERelationRef 0
* : +- CTERelationRef 0
* +- OneRowRelation
* ```
*
* 2. a query with 2 non-grouping aggregates:
* ```
* Join Inner
* :- Aggregate [min(a) AS min(a)]
* : +- Relation [a, b, c]
* +- Aggregate [sum(b) AS sum(b), avg(cast(c as double)) AS avg(c)]
* +- Relation [a, b, c]
* ```
* is optimized to:
* ```
* WithCTE
* :- CTERelationDef 0
* : +- Project [named_struct(min(a), min(a), sum(b), sum(b), avg(c),
avg(c)) AS mergedValue]
* : +- Aggregate [min(a) AS min(a), sum(b) AS sum(b), avg(cast(c as
double)) AS avg(c)]
* : +- Relation [a, b, c]
* +- Join Inner
* :- Project [scalar-subquery [].min(a) AS min(a)]
* : : +- CTERelationRef 0
* : +- OneRowRelation
* +- Project [scalar-subquery [].sum(b) AS sum(b), scalar-subquery
[].avg(c) AS avg(c)]
* : :- CTERelationRef 0
* : +- CTERelationRef 0
* +- OneRowRelation
* ```
*
* 3. a query with a subquery and a non-grouping aggregate:
* ```
* Join Inner
* :- Project [scalar-subquery [] AS scalarsubquery()]
* : : +- Aggregate [min(a) AS min(a)]
* : : +- Relation [a, b, c]
* : +- OneRowRelation
* +- Aggregate [sum(b) AS sum(b), avg(cast(c as double)) AS avg(c)]
* +- Relation [a, b, c]
* ```
* is optimized to:
* ```
* WithCTE
* :- CTERelationDef 0
* : +- Project [named_struct(min(a), min(a), sum(b), sum(b), avg(c),
avg(c)) AS mergedValue]
* : +- Aggregate [min(a) AS min(a), sum(b) AS sum(b), avg(cast(c as
double)) AS avg(c)]
* : +- Relation [a, b, c]
* +- Join Inner
* :- Project [scalar-subquery [].min(a) AS scalarsubquery()]
* : : +- CTERelationRef 0
* : +- OneRowRelation
* +- Project [scalar-subquery [].sum(b) AS sum(b), scalar-subquery
[].avg(c) AS avg(c)]
* : :- CTERelationRef 0
* : +- CTERelationRef 0
* +- OneRowRelation
* ```
*/
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]