peter-toth commented on PR #42223:
URL: https://github.com/apache/spark/pull/42223#issuecomment-1665315777

   > @peter-toth I agree that the extra project can help if we decided to 
merge. However, the plan pattern becomes complicated. Without the extra 
project, the merged aggregate is still `Aggregate -> Filter -> Scan`. We can 
just define a rule for merging two aggregates, and it can incrementally merge 
all joined aggregates. With the extra project, we need to define how to merge 
`Aggregate -> Filter -> Project -> Scan` + `Aggregate -> Filter -> Project -> 
Scan`, or `Aggregate -> Filter -> Project -> Scan` + `Aggregate -> Filter -> 
Scan`.
   
   Those patterns are already covered in the current code: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala#L235-L255
 and not touched in my aggregate merge PR: 
https://github.com/apache/spark/pull/37630/files#diff-3d3aa853c51c01216a7f7307219544a48e8c14fabe1817850e58739208bc406aR292-R318
   
   > Also note that, `WHERE cond1 OR cond2` can be better than `FROM (SELECT 
cond1 AS b_col1, cond2 AS b_col2) WHERE b_col1 OR b_col2 `, because `OR` has 
shortcut and `cond2` is not always evaluated for each input row. common 
subexpression elimination might be a better approach here, if we can make it 
cross the operator boundary (or whole-stage-codegen can only do it?)
   
   I'm nost sure `WHERE cond1 OR cond2` is better because up in the aggregate 
`cond2` is always evaluated for all rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to