agubichev commented on PR #48145: URL: https://github.com/apache/spark/pull/48145#issuecomment-2361679216
> Do all the existing optimizer rules work fine with this single join? I understand that we need to implement the single-match check in all the physical join nodes, but semantic wise, is there anything we need to take care? @cloud-fan I've traced all the usages of LeftOuter in the catalyst rules (see the full list below). In general, the rules act on the basis of "allow-list", so if the join type is not explicitly matched by the rule, it is not applied. As LeftOuter is a "close relative" to LeftSingle (in fact, at HEAD we are using LeftOuter in place of LeftSingle), it is enough to check the rules that already reference LeftOuter explicitly. Since LeftOuter joins are already super restrictive as to what kind of optimizations can be applied to them (and frequently LeftOuter joins restrict optimizations in the plan nodes around them too), I am not aware of many jointype-agnostic rules. The ones that I do know of, like ReplaceNullWithFalseInPredicate, apply to both LeftOuter and LeftSingle without change. These rules have been updated for LeftSingle join: - EliminateOuterJoin -- should not apply to LeftSingle, updated - PushPredicateThroughJoin -- not all cases should apply to LeftSingle, updated - FoldablePropagation The following rules are only matching LeftOuter join for now, therefore skipping LeftSingle join unchanged. Semantics-wise, it is ok to skip every single one of these rules for the LeftSingle join. Further analysis is needed on whether we can/should enable them for LeftSingle joins: - InferFiltersFromConstraints - LimitPushDown - PropagateEmptyRelation - PushLeftSemiLeftAntiThroughJoin - PushExtraPredicateThroughJoin There are couple of rules that apply to LeftOuter, but do not make sense to LeftSingle. In both cases LeftOuter is explicitly matched, so they will skip LeftSingle as they should: - CheckCartesianProducts - RewriteAsOfJoin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
