[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

nsyca Wed, 21 Dec 2016 08:51:37 -0800

Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/16337
  
    I will give an update on the result checking against a second database in 
two installments.
    
    The first one is for the file in-group-by.sql.
    
    **Recommended action:**
    Comment out TC 01.12 and 01.13
    
    **General comments**
    
    The second database I used to test is DB2 for Linux/Unix/Windows version 
11.1. I need to do some minor modifications on the test files to run on DB2.
    
    - Data type mappings: String -> Varchar(30), Short -> smallint, Long -> 
biting
    - String representation in DB2 uses single quote. All the input string are 
altered to DB2 format.
    - The last value of the last row of t1 has invalid date format. Spark 
suppresses the conversion error and replaces with NULL.
      DB2 does not do this implicit conversion hence changing the input value 
to NULL. (Is the invalid value an intentional act or an accidental
      human error?)
    - DB2 does not support count(distinct(*)) (in TC01.09/query 15) hence 
changing to count(distinct( <concatenation-of-all-columns> ).
    
    **The result checking**
    
    - All except 2 test cases produce different output.
    - In TC 01.12 and 13 (query 18 and 19), DB2 reports a syntax error due to 
the columns T2B in 01.12 and T2C in 01.13 are used in an invalid context.
    - TC 01.12 seems to be another case of SPARK-18863.
    - TC 01.13 is a case to debate here.
    
    A simplified script to reproduce:
    
    ````
    Seq((1,1,1)).toDF("t1a", "t1b", "t1c").createOrReplaceTempView("t1")
    Seq((1,1,1)).toDF("t2a", "t2b", "t2c").createOrReplaceTempView("t2")
    Seq((1,1,1)).toDF("t3a", "t3b", "t3c").createOrReplaceTempView("t3")
    
    â TC 01.12
    select * from t1 where t1a in
    (select min(t2a) from t2 where t2a = t2a and t2c >= 1 group by t2c having 
t2c in
    (select t3c from t3 group by t3c, t3b having t2b > 6 and t3b > t2b ))
    
    == Parsed Logical Plan ==
    'Project [*]
    +- 'Filter 't1a IN (list#803)
       :  +- 'Filter 't2c IN (list#802)
       :     :  +- 'Filter (('t2b > 6) && ('t3b > 't2b))
       :     :     +- 'Aggregate ['t3c, 't3b], ['t3c]
       :     :        +- 'UnresolvedRelation `t3`
       :     +- 'Aggregate ['t2c], [unresolvedalias('min('t2a), None)]
       :        +- 'Filter (('t2a = 't2a) && ('t2c >= 1))
       :           +- 'UnresolvedRelation `t2`
       +- 'UnresolvedRelation `t1`
    
    == Analyzed Logical Plan ==
    t1a: int, t1b: int, t1c: int
    Project [t1a#764, t1b#765, t1c#766]
    +- Filter predicate-subquery#803 [(t1a#764 = min(t2a)#816)]
       :  +- Project [min(t2a)#816]
       :     +- !Filter predicate-subquery#802 [(t2c#781 = t3c#796) && (t2b#780 
> 6) && (t3b#795 > t2b#780)]
       :        :  +- Project [t3c#796, t3b#795]
       :        :     +- Aggregate [t3c#796, t3b#795], [t3c#796, t3b#795]
       :        :        +- SubqueryAlias t3, `t3`
       :        :           +- Project [_1#790 AS t3a#794, _2#791 AS t3b#795, 
_3#792 AS t3c#796]
       :        :              +- LocalRelation [_1#790, _2#791, _3#792]
       :        +- Aggregate [t2c#781], [min(t2a#779) AS min(t2a)#816, t2c#781]
       :           +- Filter ((t2a#779 = t2a#779) && (t2c#781 >= 1))
       :              +- SubqueryAlias t2, `t2`
       :                 +- Project [_1#775 AS t2a#779, _2#776 AS t2b#780, 
_3#777 AS t2c#781]
       :                    +- LocalRelation [_1#775, _2#776, _3#777]
       +- SubqueryAlias t1, `t1`
          +- Project [_1#760 AS t1a#764, _2#761 AS t1b#765, _3#762 AS t1c#766]
             +- LocalRelation [_1#760, _2#761, _3#762]
    
    == Optimized Logical Plan ==
    Project [_1#760 AS t1a#764, _2#761 AS t1b#765, _3#762 AS t1c#766]
    +- Join LeftSemi, (_1#760 = min(t2a)#816)
       :- LocalRelation [_1#760, _2#761, _3#762]
       +- Aggregate [t2c#781], [min(t2a#779) AS min(t2a)#816]
          +- Project [_1#775 AS t2a#779, _3#777 AS t2c#781]
             +- Join LeftSemi, (((_3#777 = t3c#796) && (_2#776 > 6)) && 
(t3b#795 > _2#776))
                :- Filter (_3#777 >= 1)
                :  +- LocalRelation [_1#775, _2#776, _3#777]
                +- Aggregate [t3c#796, t3b#795], [t3c#796, t3b#795]
                   +- LocalRelation [t3b#795, t3c#796]
    ````
    
    I don't see the column t2b#780 being part of the output of the (lower) 
Aggregate operator. Somehow the LeftSemi  join for t2b > 6 is just moved down 
below the Aggregate over t2. This does not look right to me. I will record this 
problem in SPARK-18863 for further investigation.
    
    ````
    -- TC 01.13
    select * from (select * from t2 where t2a in (select t1a from t1 where t1b 
= t2b)) t2 where t2a in
    (select t2a from t2 where t2a = t2a and t2c > 1 group by t2a having t2c > 8)
    
    == Parsed Logical Plan ==
    'Project [*]
    +- 'Filter 't2a IN (list#824)
       :  +- 'Filter ('t2c > 8)
       :     +- 'Aggregate ['t2a], ['t2a]
       :        +- 'Filter (('t2a = 't2a) && ('t2c > 1))
       :           +- 'UnresolvedRelation `t2`
       +- 'SubqueryAlias t2
          +- 'Project [*]
             +- 'Filter 't2a IN (list#823)
                :  +- 'Project ['t1a]
                :     +- 'Filter ('t1b = 't2b)
                :        +- 'UnresolvedRelation `t1`
                +- 'UnresolvedRelation `t2`
    
    == Analyzed Logical Plan ==
    t2a: int, t2b: int, t2c: int
    Project [t2a#779, t2b#780, t2c#781]
    +- Filter predicate-subquery#824 [(t2a#779 = t2a#779#835) && (t2c#781 > 8)]
       :  +- Project [t2a#779 AS t2a#779#835]
       :     +- Aggregate [t2a#779], [t2a#779]
       :        +- Filter ((t2a#779 = t2a#779) && (t2c#781 > 1))
       :           +- SubqueryAlias t2, `t2`
       :              +- Project [_1#775 AS t2a#779, _2#776 AS t2b#780, _3#777 
AS t2c#781]
       :                 +- LocalRelation [_1#775, _2#776, _3#777]
       +- SubqueryAlias t2
          +- Project [t2a#779, t2b#780, t2c#781]
             +- Filter predicate-subquery#823 [(t2a#779 = t1a#764) && (t1b#765 
= t2b#780)]
                :  +- Project [t1a#764, t1b#765]
                :     +- SubqueryAlias t1, `t1`
                :        +- Project [_1#760 AS t1a#764, _2#761 AS t1b#765, 
_3#762 AS t1c#766]
                :           +- LocalRelation [_1#760, _2#761, _3#762]
                +- SubqueryAlias t2, `t2`
                   +- Project [_1#775 AS t2a#779, _2#776 AS t2b#780, _3#777 AS 
t2c#781]
                      +- LocalRelation [_1#775, _2#776, _3#777]
    ````
    The column t2c in `having t2c > 8` should be mapped to the closest scope of 
t2 but since t2c is not part of the group by column, this is an invalid usage. 
Spark then moves to find t2c in an outer scope mapping the column to t2c#781 of 
table t2 in the parent table. I don't know how the scoping rules are defined in 
Spark but the way it works could cause unintentional effect to users.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16337: [SPARK-18871][SQL] New test cases for IN/NOT IN subquery

Reply via email to