jchen5 commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1171240575
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -325,9 +327,22 @@ object PullupCorrelatedPredicates extends
Rule[LogicalPlan] with PredicateHelper
}
plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
- case ScalarSubquery(sub, children, exprId, conditions, hint) if
children.nonEmpty =>
+ case ScalarSubquery(sub, children, exprId, conditions, hint,
mayHaveCountBugOld)
+ if children.nonEmpty =>
val (newPlan, newCond) = decorrelate(sub, plan)
- ScalarSubquery(newPlan, children, exprId, getJoinCondition(newCond,
conditions), hint)
+ val mayHaveCountBug = if (mayHaveCountBugOld.isEmpty) {
+ // Check whether the pre-rewrite subquery had empty
groupingExpressions. If yes, it may
+ // be subject to the COUNT bug. If it has non-empty
groupingExpressions, there is
+ // no COUNT bug.
+ val (topPart, havingNode, aggNode) = splitSubquery(sub)
Review Comment:
Yes, I tried it but the problem is it introduces an extra DomainJoin (i.e.
an extra left outer join with an extra copy of the outer table), so it has
worse performance - in the future I think it's possible to get rid of that
extra DomainJoin by making the logic smarter but it's nontrivial. Also, I
wanted to make this more targeted fix first to reduce risk.
I linked this jira to a jira for unifying the count bug handling
https://issues.apache.org/jira/browse/SPARK-36113.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]