Github user nsyca commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16046#discussion_r90152623
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
    @@ -1120,47 +1173,54 @@ class Analyzer(
               } else {
                 a
               }
    -        case w : Window =>
    -          failOnOuterReference(w)
    -          failOnNonEqualCorrelatedPredicate(foundNonEqualCorrelatedPred, w)
    -          w
    -        case j @ Join(left, _, RightOuter, _) =>
    -          failOnOuterReference(j)
    -          failOnOuterReferenceInSubTree(left, "a RIGHT OUTER JOIN")
    -          j
    -        // SPARK-18578: Do not allow any correlated predicate
    -        // in a Full (Outer) Join operator and its descendants
    -        case j @ Join(_, _, FullOuter, _) =>
    -          failOnOuterReferenceInSubTree(j, "a FULL OUTER JOIN")
    -          j
    -        case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] =>
    -          failOnOuterReference(j)
    -          failOnOuterReferenceInSubTree(right, "a LEFT (OUTER) JOIN")
    +
    +        // Join can host correlated expressions.
    +        case j @ Join(left, right, joinType, _) =>
    +          joinType match {
    +            // Inner join, like Filter, can be anywhere.
    +            // LeftSemi is a special case of Inner join which returns
    +            // only the first matched row to the right table.
    +            case _: InnerLike | LeftSemi =>
    --- End diff --
    
    The reason I write this lengthy response here is to convince you that we 
should leave `LeftSemi` in the same group as `InnerJoin`. Please bear with me.
    
    The example you gave here demonstrates a limitation of the subquery 
supported in Spark today. We should plan to be able to handle this case of deep 
correlation in the future:
    
    ````sql
    select  *
    from    t
    where   exists (select 1
                    from   t2
                    where  t2.c1=t1.c1
                    and    exists (select 1
                                   from   t3
                                   where  t3.c2=t1.c1))
    ````
    
    And if we do, then we will need to allow LeftSemi to output the columns 
from the right table.
    
    One way to imagine a use case of LeftSemi is if we have a look up join 
where the join predicate forms a N:1 relationship, just like between a foreign 
key and its primary key. The join is effectively a LeftSemi that is guaranteed 
we only need to find the first matched row and move on to the next row of the 
left table (just like a hash join that we need to probe the first matched and 
stop early seeking the next matched in the hash chain). From a run-time 
viewpoint, a LeftSemi is (almost, more on this later) better than a regular 
InnerJoin that it does not need to probe for the next matched rows regardless 
of the chosen join methods: nested-loop join, sort-merge join, or hash join. A 
LeftSemi, however, dictates which tables can be the left and the right. As the 
name implies, the left table of a LeftSemi needs to be the N-side of the N:1 
join. So in the case that N-side is the smaller table, it could be better to do 
the regular inner join with 1-side as the left table then perform a 
 compensation on top of the join to remove the duplicate matched rows. Having 
said that, we can also implement a RightSemi join in the runtime layer so that 
we can pick any join, LeftSemi, RightSemi, or InnerJoin with a compensation, 
based on the cost.
    
    Until we wade into the CBO, the whole planning business would be an 
interesting area, isn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to