Github user nsyca commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17491#discussion_r109222591
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
    @@ -90,11 +90,12 @@ trait PredicateHelper {
        * Returns true iff `expr` could be evaluated as a condition within join.
        */
       protected def canEvaluateWithinJoin(expr: Expression): Boolean = expr 
match {
    -    case l: ListQuery =>
    +    case _: ListQuery | _: Exists =>
           // A ListQuery defines the query which we want to search in an IN 
subquery expression.
           // Currently the only way to evaluate an IN subquery is to convert 
it to a
           // LeftSemi/LeftAnti/ExistenceJoin by `RewritePredicateSubquery` 
rule.
           // It cannot be evaluated as part of a Join operator.
    +      // An Exists shouldn't be push into a Join operator too.
    --- End diff --
    
    What this code does is around the idea of treating an uncorrelated subquery 
as a black box. The subquery is processed as a self-contained operation and a 
list of values is returned. After that, the code evaluates as if this is an IN 
list predicate like <col> IN (<list-of-literals>). In your code above, <col> is 
represented as a "true" literal. That means the returned values from the 
subquery must be in Boolean type too.
    
    Putting a LIMIT does help to short-circuit the processing to the first row. 
I still think putting a LIMIT explicitly as an extra LogicalPlan operator may 
have some negative side effect in the way that it prevents other Optimizer 
rules to further optimize the query.
    
    I feel this optimization could be done better in the run-time area, rather 
than trying to shoehorn it in the Optimizer phase. What we can do is 1) 
propagate the notion of "early out" deeper to the operator on the RHS of the 
outer join. If it's a scan, stop scanning on the first row. 2) one more step 
further: cache the result of the RHS without a rescan because the next row from 
the parent table will always get the same answer from rescanning the subquery.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to