cloud-fan commented on pull request #35290:
URL: https://github.com/apache/spark/pull/35290#issuecomment-1019849857
Is it a breaking change? Seems it will break `val df = ...; df.select("a",
"a").select(df("a"));`
I'd like to have a surgical fix for this bug. Ideally, when a plan has
conflicting output attr ids, the columns with the same attr id should output
the same value, so it doesn't matter which column you bind to at the end.
However, `Union` breaks this assumption. We don't want to fix `Union` in old
branches as it's a breaking change, and we should fix the attribute binding
logic for `Union` to minimize the impact.
My proposal is:
1. For SQL API, the plan is `Distinct(Union(A, B))`. In the rule
`ReplaceDistinctWithAggregate`, we replace `Distinct` with a group-only
`Aggregate`, and take all the output attrs of `Union` as the grouping columns.
We can add a special metadata property to `AttributeReference`, to indicate the
actual column it wants to bind. e.g. `__index_of_attributes_with_same_id=1`
means it should bind to the second column that has this attribute id, and we
update `BindReference` accordingly to implement this special binding logic.
2. For dataframe API (`df.distinct`), the plan is `Deduplicate(attrs,
Union(A, B))`. We can use the same idea to add special metadata property in
`df.distinct`.
3. For `PushProjectionThroughUnion`, we skip the optimization if Union has
conflicting attr ids in its output.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]