karenfeng opened a new pull request #35760:
URL: https://github.com/apache/spark/pull/35760


   ### What changes were proposed in this pull request?
   
   Fixes a correctness bug in `Union` in the case that there are duplicate 
output columns. Previously, duplicate columns on one side of the union would 
result in a duplicate column being output on the other side of the union.
   
   To do so, we go through the union’s child’s output and find the duplicates. 
For each duplicate set, there is a first duplicate: this one is left alone. All 
following duplicates are aliased and given a tag; this tag is used to remove 
ambiguity during resolution.
   
   As the first duplicate is left alone, the user can still select it, avoiding 
a breaking change. As the later duplicates are given new expression IDs, this 
fixes the correctness bug.
   
   ### Why are the changes needed?
   
   Output of union with duplicate columns in the children was incorrect
   
   ### Does this PR introduce _any_ user-facing change?
   
   Example query:
   ```
   SELECT a, a FROM VALUES (1, 1), (1, 2) AS t1(a, b)
   UNION ALL SELECT c, d FROM VALUES (2, 2), (2, 3) AS t2(c, d)
   ```
   
   Result before:
   ```
   a | a
   _ | _
   1 | 1
   1 | 1
   2 | 2
   2 | 2
   ```
   
   Result after:
   ```
   a | a
   _ | _
   1 | 1
   1 | 2
   2 | 2
   2 | 3
   ```
   
   ### How was this patch tested?
   
   Unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to