peter-toth commented on PR #42223: URL: https://github.com/apache/spark/pull/42223#issuecomment-1665315777
> @peter-toth I agree that the extra project can help if we decided to merge. However, the plan pattern becomes complicated. Without the extra project, the merged aggregate is still `Aggregate -> Filter -> Scan`. We can just define a rule for merging two aggregates, and it can incrementally merge all joined aggregates. With the extra project, we need to define how to merge `Aggregate -> Filter -> Project -> Scan` + `Aggregate -> Filter -> Project -> Scan`, or `Aggregate -> Filter -> Project -> Scan` + `Aggregate -> Filter -> Scan`. Those patterns are already covered in the current code: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/MergeScalarSubqueries.scala#L235-L255 and not touched in my aggregate merge PR: https://github.com/apache/spark/pull/37630/files#diff-3d3aa853c51c01216a7f7307219544a48e8c14fabe1817850e58739208bc406aR292-R318 > Also note that, `WHERE cond1 OR cond2` can be better than `FROM (SELECT cond1 AS b_col1, cond2 AS b_col2) WHERE b_col1 OR b_col2 `, because `OR` has shortcut and `cond2` is not always evaluated for each input row. common subexpression elimination might be a better approach here, if we can make it cross the operator boundary (or whole-stage-codegen can only do it?) I'm nost sure `WHERE cond1 OR cond2` is better because up in the aggregate `cond2` is always evaluated for all rows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
