chasingegg commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1061512536


   > ### What changes were proposed in this pull request?
   > Fixes a correctness bug in `Union` in the case that there are duplicate 
output columns. Previously, duplicate columns on one side of the union would 
result in a duplicate column being output on the other side of the union.
   > 
   > To do so, we go through the union’s child’s output and find the 
duplicates. For each duplicate set, there is a first duplicate: this one is 
left alone. All following duplicates are aliased and given a tag; this tag is 
used to remove ambiguity during resolution.
   > 
   > As the first duplicate is left alone, the user can still select it, 
avoiding a breaking change. As the later duplicates are given new expression 
IDs, this fixes the correctness bug.
   > 
   > ### Why are the changes needed?
   > Output of union with duplicate columns in the children was incorrect
   > 
   > ### Does this PR introduce _any_ user-facing change?
   > Example query:
   > 
   > ```
   > SELECT a, a FROM VALUES (1, 1), (1, 2) AS t1(a, b)
   > UNION ALL SELECT c, d FROM VALUES (2, 2), (2, 3) AS t2(c, d)
   > ```
   > 
   > Result before:
   > 
   > ```
   > a | a
   > _ | _
   > 1 | 1
   > 1 | 1
   > 2 | 2
   > 2 | 2
   > ```
   > 
   > Result after:
   > 
   > ```
   > a | a
   > _ | _
   > 1 | 1
   > 1 | 2
   > 2 | 2
   > 2 | 3
   > ```
   > 
   > ### How was this patch tested?
   > Unit tests
   
   Result After should be
   ```
    a | a
    _ | _
    1 | 1
    1 | 1
    2 | 2
    2 | 3
   ```
   ?
   And why we should support `select a from (SELECT a, a FROM VALUES (1, 1), 
(1, 2) AS t1(a, b)
   UNION ALL SELECT c, d FROM VALUES (2, 2), (2, 3) AS t2(c, d))`? The result 
after this fix is 
    a
    _ 
    1 
    1 
    2 
    2 
   I think it should be broken because it is ambiguous, instead of choosing the 
first column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to