[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

nsyca Wed, 31 Aug 2016 09:11:06 -0700

Github user nsyca commented on the issue:

    https://github.com/apache/spark/pull/14899
  
    Some background on my investigation of the problem:
    
    How can we do name resolution in an SQL statement?
    
    ```
    scala> sql("select * from t1, t1 t2").explain(true)
    == Parsed Logical Plan ==
    'Project [*]
    +- 'Join Inner
       :- 'UnresolvedRelation `t1`
       +- 'UnresolvedRelation `t1`, t2
    
    == Analyzed Logical Plan ==
    c1: int, c1: int
    Project [c1#3, c1#147]
    +- Join Inner
       :- SubqueryAlias t1
       :  +- Project [value#1 AS c1#3]
       :     +- LocalRelation [value#1]
       +- SubqueryAlias t2
          +- SubqueryAlias t1
             +- Project [value#1 AS c1#147]
                +- LocalRelation [value#1]
    ```
    
    Here the second reference to the column c1 is changed to c1#147 by the 
private method `dedupRight`  in object `ResolveReferences` in Analyzer.scala. 
The problem is `dedupRight` is only called in 3 contexts, Join, Intersect, and 
Except.
    
    ```
          // To resolve duplicate expression IDs for Join and Intersect
          case j @ Join(left, right, _, _) if !j.duplicateResolved =>
            j.copy(right = dedupRight(left, right))
          case i @ Intersect(left, right) if !i.duplicateResolved =>
            i.copy(right = dedupRight(left, right))
          case i @ Except(left, right) if !i.duplicateResolved =>
            i.copy(right = dedupRight(left, right))
    ```
    
    The assumption of this name resolution algorithm is any name conflicts 
outside of the 3 aforementioned contexts are okay. Those name conflicts will 
never mix with one another to create any ambiguity. This is certainly not the 
case in the example below.
    
    ```
    Seq(1,2).toDF("c1").createOrReplaceTempView("t1")
    Seq(1).toDF("c2").createOrReplaceTempView("t2")
    
    scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on 
t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").show
    
    +----+
    |  c3|
    +----+
    |   2|
    |null|
    +----+
    ```
    
    The correct answer is 1 row of (2). From the plan below, the incorrect 
portion of the plan is the LeftAnti, rewritten from the NOT IN subquery, is 
pushed down below the (T1 LOJ T2) operation. Because LeftAnti predicate is 
evaluated to unknown (or null) if any argument of a comparison operator in the 
predicate is null, e.g. NULL = <value> is evaluated to unknown (which is 
equivalent to false in the context of a predicate), the LeftAnti predicate 
cannot be pushed down into a LOJ operation.
    
    The logical plan looks like:
    
    ```
    scala> sql("select * from (select t2.c2+1 as c3 from t1 left join t2 on 
t1.c1=t2.c2) t3 where c3 not in (select c2 from t2)").explain(true)
    == Parsed Logical Plan ==
    'Project [*]
    +- 'Filter NOT 'c3 IN (list#124)
       :  +- 'SubqueryAlias list#124
       :     +- 'Project ['c2]
       :        +- 'UnresolvedRelation `t2`
       +- 'SubqueryAlias t3
          +- 'Project [('t2.c2 + 1) AS c3#123]
             +- 'Join LeftOuter, ('t1.c1 = 't2.c2)
                :- 'UnresolvedRelation `t1`
                +- 'UnresolvedRelation `t2`
    
    == Analyzed Logical Plan ==
    c3: int
    Project [c3#123]
    +- Filter NOT predicate-subquery#124 [(c3#123 = c2#77)]
       :  +- SubqueryAlias predicate-subquery#124 [(c3#123 = c2#77)]
       :     +- Project [c2#77]
       :        +- SubqueryAlias t2
       :           +- Project [value#75 AS c2#77]
       :              +- LocalRelation [value#75]
       +- SubqueryAlias t3
          +- Project [(c2#77 + 1) AS c3#123]
             +- Join LeftOuter, (c1#3 = c2#77)
                :- SubqueryAlias t1
                :  +- Project [value#1 AS c1#3]
                :     +- LocalRelation [value#1]
                +- SubqueryAlias t2
                   +- Project [value#75 AS c2#77]
                      +- LocalRelation [value#75]
    ...
    ```
    The culprit is the column `c2#77` from the two references of `SubqueryAlias 
t2` are from two different contexts. They cannot share the same name. The 
problem is subtle at this point from the derived column `c3#123` of the 
expression `(c2#77 + 1)` but it becomes obvious when looking at the Optimized 
Logical Plan (below):
    
    ```
    == Optimized Logical Plan ==
    Project [(c2#77 + 1) AS c3#123]
    +- Join LeftOuter, (c1#3 = c2#77)
       :- Project [value#1 AS c1#3]
       :  +- Join LeftAnti, (isnull(((c2#77 + 1) = c2#77)) || ((c2#77 + 1) = 
c2#77))
       :     :- LocalRelation [value#1]
       :     +- LocalRelation [c2#77]
       +- LocalRelation [c2#77]
    ```
    
    Given more and more rewrite rules can be/will be added in the future, it is 
impossible to ensure that any name conflicts in any part of the same SQL 
statement will not collide as the logical plan gets rewritten. I think a solid 
solution is to generate a unique name for any column from any base relation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14899: [SPARK-17337][SQL] Incomplete algorithm for name resolut...

Reply via email to