[GitHub] [iceberg] singhpk234 commented on a change in pull request #3992: Spark 3.2: Fix cardinality check for alternative join implementations

GitBox Tue, 01 Feb 2022 06:24:18 -0800


singhpk234 commented on a change in pull request #3992:
URL: https://github.com/apache/iceberg/pull/3992#discussion_r796278084




##########
File path: 
spark/v3.2/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala
##########
@@ -186,7 +187,7 @@ object RewriteMergeIntoTable extends RewriteRowLevelCommand 
{
     // disable broadcasts for the target table to perform the cardinality check
     val joinType = if (notMatchedActions.isEmpty) LeftOuter else FullOuter
     val joinHint = JoinHint(leftHint = 
Some(HintInfo(Some(NO_BROADCAST_HASH))), rightHint = None)
-    val joinPlan = Join(targetTableProj, sourceTableProj, joinType, 
Some(cond), joinHint)
+    val joinPlan = Join(NoStatsUnaryNode(targetTableProj), sourceTableProj, 
joinType, Some(cond), joinHint)

Review comment:
       [doubt] Hi All, Can adding a new UnaryNode mess with predicatePushDown 
rule ?
    
   At present predicatePushDown rule maintains the list of unary logical 
operator from which it's safe to push down the filter's from. so when a new 
node we introduce since it's not present in the allow-list and hence filter 
will get stuck at the parent of that node ... and will not be pushed through, 
which in turn cause perf regression. 
   
   Not sure if this is the case what's the best possible way to handle it. 
[Spark 3.2 code 
pointer](https://github.com/apache/spark/blob/v3.2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1528-L1550)
   
   ```
   
       case filter @ Filter(_, u: UnaryNode)
           if canPushThrough(u) && u.expressions.forall(_.deterministic) =>
         pushDownPredicate(filter, u.child) { predicate =>
           u.withNewChildren(Seq(Filter(predicate, u.child)))
         }
     }
   
     def canPushThrough(p: UnaryNode): Boolean = p match {
       // Note that some operators (e.g. project, aggregate, union) are being 
handled separately
       // (earlier in this rule).
       case _: AppendColumns => true
       case _: Distinct => true
       case _: Generate => true
       case _: Pivot => true
       case _: RepartitionByExpression => true
       case _: Repartition => true
       case _: ScriptTransformation => true
       case _: Sort => true
       case _: BatchEvalPython => true
       case _: ArrowEvalPython => true
       case _: Expand => true
       case _ => false
     }
   ```
   
   I recently hit something similar (not in iceberg), hence posting here just 
in case it's applicable here as well.
   
   cc @aokolnychyi @rdblue @RussellSpitzer 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a change in pull request #3992: Spark 3.2: Fix cardinality check for alternative join implementations

Reply via email to