Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21184#discussion_r201961786
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
    @@ -284,6 +288,80 @@ class Analyzer(
         }
       }
     
    +  /**
    +   * Replaces [[Alias]] with the same exprId but different references with 
[[Alias]] having
    +   * different exprIds. This is a rare situation which can cause incorrect 
results.
    +   */
    +  object DeduplicateAliases extends Rule[LogicalPlan] {
    --- End diff --
    
    The main problem which causes the added UT to fail is that 
`FoldablePropagation` replaces all foldable aliases which are considered to be 
the same. If the same alias with same exprId is located in different part of 
the plan (referencing actually different things, despite they have the same 
id...) this can cause wrong replacement to happen. So in the added UT, the plan 
is:
    ```
    == Analyzed Logical Plan ==
    a: int, b: int, n: bigint
    Union
    :- Project [a#5, b#17, n#19L]
    :  +- Project [a#5, b#17, n#19L, n#19L]
    :     +- Window [count(1) 
windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), 
unboundedfollowing$())) AS n#19L]
    :        +- Project [a#5, b#6 AS b#17]
    :           +- Project [_1#2 AS a#5, _2#3 AS b#6]
    :              +- LocalRelation [_1#2, _2#3]
    +- Project [a#12, b#17, n#19L]
       +- Project [a#12, b#17, n#19L, n#19L]
          +- Window [count(1) 
windowspecdefinition(specifiedwindowframe(RowFrame, unboundedpreceding$(), 
unboundedfollowing$())) AS n#19L]
             +- Project [a#12, b#14 AS b#17]
                +- Project [a#12, 0 AS b#14]
                   +- Project [value#10 AS a#12]
                      +- LocalRelation [value#10]
    ```
    Please note that in both the branches of the union we have the same `b#17` 
attribute, but they are referencing different things. As the lower one is a 
foldable value which evaluates to 0, all the `b#17` are replace with a literal 
0, causing a wrong result.
    
    Despite we might fix this specific problem in the related Optimizer rule, I 
think that in general we assume that items with the same id are the same. So I 
proposed this solution in order to fix all the possible issues which may arise 
due to this situation which is not expected.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to