[GitHub] [spark] jchen5 commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

via GitHub Wed, 19 Apr 2023 05:06:21 -0700


jchen5 commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1171240575



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -325,9 +327,22 @@ object PullupCorrelatedPredicates extends 
Rule[LogicalPlan] with PredicateHelper
     }
 
     plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
-      case ScalarSubquery(sub, children, exprId, conditions, hint) if 
children.nonEmpty =>
+      case ScalarSubquery(sub, children, exprId, conditions, hint, 
mayHaveCountBugOld)
+        if children.nonEmpty =>
         val (newPlan, newCond) = decorrelate(sub, plan)
-        ScalarSubquery(newPlan, children, exprId, getJoinCondition(newCond, 
conditions), hint)
+        val mayHaveCountBug = if (mayHaveCountBugOld.isEmpty) {
+          // Check whether the pre-rewrite subquery had empty 
groupingExpressions. If yes, it may
+          // be subject to the COUNT bug. If it has non-empty 
groupingExpressions, there is
+          // no COUNT bug.
+          val (topPart, havingNode, aggNode) = splitSubquery(sub)

Review Comment:
   Yes, I tried it but the problem is it introduces an extra DomainJoin (i.e. 
an extra left outer join with an extra copy of the outer table), so it has 
worse performance - in the future I think it's possible to get rid of that 
extra DomainJoin by making the logic smarter but it's nontrivial. Also, I 
wanted to make this more targeted fix first to reduce risk.
   
   I linked this jira to a jira for unifying the count bug handling 
https://issues.apache.org/jira/browse/SPARK-36113.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] jchen5 commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Reply via email to