chasingegg opened a new pull request #35290:
URL: https://github.com/apache/spark/pull/35290
This is the backport PR replated to #35168.
### What changes were proposed in this pull request?
When the first child of Union has duplicate columns like select a, a from t1
union select a, b from t2, spark only use the first column to aggregate the
results, which would make the results incorrect, and this behavior is
inconsistent with other engines like PostgreSQL, MySQL. **We could alias the
attribute of the first child of union to resolve this, or you could argue that
this is the feature of Spark SQL**.
sample query:
select
a,
a
from values (1, 1), (1, 2) as t1(a, b)
UNION ALL
SELECT
c,
d
from values (2, 3), (2, 3) as t2(c, d)
result is (1, 1), (1, 1), (3, 3), (3, 3)
expected (1, 1), (1, 1), (2, 3), (2, 3)
---
select
a,
a
from values (1, 1), (1, 2) as t1(a, b)
UNION
SELECT
c,
d
from values (2, 3), (2, 3) as t2(c, d)
result is (1, 1), (2, 2)
expected (1, 1), (2, 3)
### Why are the changes needed?
It is possibly a bug.
### Does this PR introduce _any_ user-facing change?
Yes. When we union with the first child has duplicate columns, the result
would be different.
### How was this patch tested?
Add new UT.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]