chasingegg opened a new pull request #35290:
URL: https://github.com/apache/spark/pull/35290


   This is the backport PR replated to #35168.
   
   ### What changes were proposed in this pull request?
   
   When the first child of Union has duplicate columns like select a, a from t1 
union select a, b from t2, spark only use the first column to aggregate the 
results, which would make the results incorrect, and this behavior is 
inconsistent with other engines like PostgreSQL, MySQL. **We could alias the 
attribute of the first child of union to resolve this, or you could argue that 
this is the feature of Spark SQL**.
   
   sample query:
   select
   a,
   a
   from values (1, 1), (1, 2) as t1(a, b)
   UNION ALL
   SELECT
   c,
   d
   from values (2, 3), (2, 3) as t2(c, d)
   
   result is   (1, 1), (1, 1), (3, 3), (3, 3) 
   expected (1, 1), (1, 1), (2, 3), (2, 3)
   
   ---
   
   select
   a,
   a
   from values (1, 1), (1, 2) as t1(a, b)
   UNION
   SELECT
   c,
   d
   from values (2, 3), (2, 3) as t2(c, d)
   
   result is  (1, 1), (2, 2)
   expected (1, 1), (2, 3)
   
   
   ### Why are the changes needed?
   
   It is possibly a bug.
   
   
   ### Does this PR introduce _any_ user-facing change?
    
   Yes. When we union with the first child has duplicate columns, the result 
would be different.
   
   
   ### How was this patch tested?
   
   Add new UT.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to