Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/16337
  
    Second installment: subquery/in-subquery/simple-in.sql
    
    **Recommended action:**
    Comment out TC 01.08 and 01.10
    
    **The result checking**
    - All except 2 test cases produce different output.
    - In both TC 01.08 and 10 (query 14 and 16), there are NOT IN subqueries
      that reference outer columns containing null data, namely t1i in 01.08, 
and t2b in 01.10.
      These two cases reveal a problem related to the nullAware LeftAnti 
operator which is 
      rewritten from a NOT IN subquery.
    
    Instead of walking through the original complex test cases, I can 
demonstrate the problem
    with a simpler script:
    
    ````
    Seq((1, 2)).toDF("a1", "b1").createOrReplaceTempView("t1")
    Seq[(java.lang.Integer, java.lang.Integer)]((1, null)).toDF("a2", 
"b2").createOrReplaceTempView("t2")
    
    // The expected result is 1 row of (1,2) as shown in the next statement.
    sql("select * from t1 where a1 not in (select a2 from t2 where b2 = 
b1)").show
    +---+---+
    | a1| b1|
    +---+---+
    +---+---+
    
    sql("select * from t1 where a1 not in (select a2 from t2 where b2 = 
2)").show
    +---+---+
    | a1| b1|
    +---+---+
    |  1|  2|
    +---+---+
    
    == Analyzed Logical Plan ==
    a1: int, b1: int
    Project [a1#684, b1#685]
    +- Filter NOT predicate-subquery#717 [(a1#684 = a2#695) && (b2#696 = 
b1#685)]
       :  +- Project [a2#695, b2#696]
       :     +- SubqueryAlias t2, `t2`
       :        +- Project [_1#692 AS a2#695, _2#693 AS b2#696]
       :           +- LocalRelation [_1#692, _2#693]
       +- SubqueryAlias t1, `t1`
          +- Project [_1#681 AS a1#684, _2#682 AS b1#685]
             +- LocalRelation [_1#681, _2#682]
    
    == Optimized Logical Plan ==
    Project [_1#681 AS a1#684, _2#682 AS b1#685]
    +- Join LeftAnti, ((isnull((_1#681 = a2#695)) || isnull((b2#696 = _2#682))) 
|| ((_1#681 = a2#695) && (b2#696 = _2#682)))
       :- LocalRelation [_1#681, _2#682]
       +- LocalRelation [a2#695, b2#696]
    ````
    
    When the comparison between b1 and b2 is unknown, the subquery returns no 
row hence the result is all the rows from the parent. The problem in the plan 
above is the expression 
    `isnull((b2#696 = _2#682))` is evaluated to `isnull( unknown )` which is 
true — that results in the entire expression of the `LeftAnti` is true. By 
the definition of `LeftAnti`, the rows are matched and should be filtered. I 
have opened SPARK-18966 to track this problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to