cloud-fan commented on pull request #35290:
URL: https://github.com/apache/spark/pull/35290#issuecomment-1019849857


   Is it a breaking change? Seems it will break `val df = ...; df.select("a", 
"a").select(df("a"));`
   
   I'd like to have a surgical fix for this bug. Ideally, when a plan has 
conflicting output attr ids, the columns with the same attr id should output 
the same value, so it doesn't matter which column you bind to at the end. 
However, `Union` breaks this assumption. We don't want to fix `Union` in old 
branches as it's a breaking change, and we should fix the attribute binding 
logic for `Union` to minimize the impact.
   
   My proposal is:
   1. For SQL API, the plan is `Distinct(Union(A, B))`. In the rule 
`ReplaceDistinctWithAggregate`, we replace `Distinct` with a group-only 
`Aggregate`, and take all the output attrs of `Union` as the grouping columns. 
We can add a special metadata property to `AttributeReference`, to indicate the 
actual column it wants to bind. e.g. `__index_of_attributes_with_same_id=1` 
means it should bind to the second column that has this attribute id, and we 
update `BindReference` accordingly to implement this special binding logic.
   2. For dataframe API (`df.distinct`), the plan is `Deduplicate(attrs, 
Union(A, B))`. We can use the same idea to add special metadata property in 
`df.distinct`.
   3. For `PushProjectionThroughUnion`, we skip the optimization if Union has 
conflicting attr ids in its output.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to