Github user nsyca commented on a diff in the pull request: https://github.com/apache/spark/pull/16046#discussion_r90152623 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1120,47 +1173,54 @@ class Analyzer( } else { a } - case w : Window => - failOnOuterReference(w) - failOnNonEqualCorrelatedPredicate(foundNonEqualCorrelatedPred, w) - w - case j @ Join(left, _, RightOuter, _) => - failOnOuterReference(j) - failOnOuterReferenceInSubTree(left, "a RIGHT OUTER JOIN") - j - // SPARK-18578: Do not allow any correlated predicate - // in a Full (Outer) Join operator and its descendants - case j @ Join(_, _, FullOuter, _) => - failOnOuterReferenceInSubTree(j, "a FULL OUTER JOIN") - j - case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => - failOnOuterReference(j) - failOnOuterReferenceInSubTree(right, "a LEFT (OUTER) JOIN") + + // Join can host correlated expressions. + case j @ Join(left, right, joinType, _) => + joinType match { + // Inner join, like Filter, can be anywhere. + // LeftSemi is a special case of Inner join which returns + // only the first matched row to the right table. + case _: InnerLike | LeftSemi => --- End diff -- The reason I write this lengthy response here is to convince you that we should leave `LeftSemi` in the same group as `InnerJoin`. Please bear with me. The example you gave here demonstrates a limitation of the subquery supported in Spark today. We should plan to be able to handle this case of deep correlation in the future: ````sql select * from t where exists (select 1 from t2 where t2.c1=t1.c1 and exists (select 1 from t3 where t3.c2=t1.c1)) ```` And if we do, then we will need to allow LeftSemi to output the columns from the right table. One way to imagine a use case of LeftSemi is if we have a look up join where the join predicate forms a N:1 relationship, just like between a foreign key and its primary key. The join is effectively a LeftSemi that is guaranteed we only need to find the first matched row and move on to the next row of the left table (just like a hash join that we need to probe the first matched and stop early seeking the next matched in the hash chain). From a run-time viewpoint, a LeftSemi is (almost, more on this later) better than a regular InnerJoin that it does not need to probe for the next matched rows regardless of the chosen join methods: nested-loop join, sort-merge join, or hash join. A LeftSemi, however, dictates which tables can be the left and the right. As the name implies, the left table of a LeftSemi needs to be the N-side of the N:1 join. So in the case that N-side is the smaller table, it could be better to do the regular inner join with 1-side as the left table then perform a compensation on top of the join to remove the duplicate matched rows. Having said that, we can also implement a RightSemi join in the runtime layer so that we can pick any join, LeftSemi, RightSemi, or InnerJoin with a compensation, based on the cost. Until we wade into the CBO, the whole planning business would be an interesting area, isn't it?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org