chasingegg edited a comment on pull request #35290:
URL: https://github.com/apache/spark/pull/35290#issuecomment-1024823661
> Is it a breaking change? Seems it will break `val df = ...; df.select("a",
"a").select(df("a"));`
>
> I'd like to have a surgical fix for this bug. Ideally, when a plan has
conflicting output attr ids, the columns with the same attr id should output
the same value, so it doesn't matter which column you bind to at the end.
However, `Union` breaks this assumption. We don't want to fix `Union` in old
branches as it's a breaking change, and we should fix the attribute binding
logic for `Union` to minimize the impact.
>
> My proposal is:
>
> 1. For SQL API, the plan is `Distinct(Union(A, B))`. In the rule
`ReplaceDistinctWithAggregate`, we replace `Distinct` with a group-only
`Aggregate`, and take all the output attrs of `Union` as the grouping columns.
We can add a special metadata property to `AttributeReference`, to indicate the
actual column it wants to bind. e.g. `__index_of_attributes_with_same_id=1`
means it should bind to the second column that has this attribute id, and we
update `BindReference` accordingly to implement this special binding logic.
> 2. For dataframe API (`df.distinct`), the plan is `Deduplicate(attrs,
Union(A, B))`. We can use the same idea to add special metadata property in
`df.distinct`.
> 3. For `PushProjectionThroughUnion`, we skip the optimization if Union has
conflicting attr ids in its output.
@cloud-fan Sorry for the late reply. Thanks for your proposal! However I
don't think it is a good way to fix it like that. First of all, I think we
should clear that what the result of union all/union should be. I still use
some examples with duplicate columns of first union's child. There are two
cases.
We have a `conflicting example`.
sample query:
select
a,
a
from values (1, 1), (1, 2) as t1(a, b)
UNION ALL
SELECT
c,
d
from values (2, 3), (2, 3) as t2(c, d)
result is (1, 1), (1, 1), (3, 3), (3, 3)
expected (1, 1), (1, 1), (2, 3), (2, 3)
And we have a `non-conflicting` example like below
sample query:
select
a,
a
from values (1, 1), (1, 2) as t1(a, b)
UNION ALL
SELECT
c,
c
from values (2, 3), (2, 3) as t2(c, d)
result is (1, 1), (1, 1), (2, 2), (2, 2)
---
As you said, Union would break the assumption that the columns with the same
attr id should output the same value. So I think it should alias as two
different attr ids since their values could be different. Let's take
`conflicting example` as the subquery, and we do **select a from subquery**, it
is an expected behavior that the query would be broken since there are two
columns with the same name but their values are different.
But whether the `non-conflicting example` should also be broken, that is the
question. If we alias the output of union, it would complain about both two
cases. Or we could carefully alias only when it is the `conflicting example`.
P.S. I tested on PostgreSQL, it would complain about both two cases as it is
`ambiguous`, so I think it alias the outputs of Union.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]