c21 commented on a change in pull request #31630:
URL: https://github.com/apache/spark/pull/31630#discussion_r581774000
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -555,6 +559,8 @@ object LimitPushDown extends Rule[LogicalPlan] {
join.copy(
left = maybePushLocalLimit(exp, left),
right = maybePushLocalLimit(exp, right))
+ case LeftSemi | LeftAnti if conditionOpt.isEmpty =>
+ join.copy(left = maybePushLocalLimit(exp, left))
Review comment:
@maropu - this actually reminds me whether we can further optimize
during runtime, and I found I already did it for LEFT SEMI with AQE -
https://github.com/apache/spark/pull/29484 . Similarly for LEFT ANTI join
without condition, we can convert join logical plan node to an empty relation
if right build side is not empty. Will submit a followup PR tomorrow.
In addition, after taking a deep look at `BroadcastNestedLoopJoinExec`
(never looked closely to that because it's not popular in our environment), I
found many places that we can optimize:
* populate `outputOrdering` and `outputPartitioning` when possible to avoid
shuffle/sort in later stage.
* shortcut for `LEFT SEMI/ANTI` in `defaultJoin()` as we don't need to look
through all rows when there's no join condition.
* code-gen the operator.
I will file an umbrella JIRA with minor priority and do it gradually.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]