[GitHub] [spark] chasingegg commented on pull request #35290: [SPARK-37865][SQL][3.0]Fix union bug when the first child of union has duplicate columns

GitBox Fri, 28 Jan 2022 19:38:42 -0800


chasingegg commented on pull request #35290:
URL: https://github.com/apache/spark/pull/35290#issuecomment-1024823661



   > Is it a breaking change? Seems it will break `val df = ...; df.select("a", 
"a").select(df("a"));`
   > 
   > I'd like to have a surgical fix for this bug. Ideally, when a plan has 
conflicting output attr ids, the columns with the same attr id should output 
the same value, so it doesn't matter which column you bind to at the end. 
However, `Union` breaks this assumption. We don't want to fix `Union` in old 
branches as it's a breaking change, and we should fix the attribute binding 
logic for `Union` to minimize the impact.
   > 
   > My proposal is:
   > 
   > 1. For SQL API, the plan is `Distinct(Union(A, B))`. In the rule 
`ReplaceDistinctWithAggregate`, we replace `Distinct` with a group-only 
`Aggregate`, and take all the output attrs of `Union` as the grouping columns. 
We can add a special metadata property to `AttributeReference`, to indicate the 
actual column it wants to bind. e.g. `__index_of_attributes_with_same_id=1` 
means it should bind to the second column that has this attribute id, and we 
update `BindReference` accordingly to implement this special binding logic.
   > 2. For dataframe API (`df.distinct`), the plan is `Deduplicate(attrs, 
Union(A, B))`. We can use the same idea to add special metadata property in 
`df.distinct`.
   > 3. For `PushProjectionThroughUnion`, we skip the optimization if Union has 
conflicting attr ids in its output.
   
   @cloud-fan  Sorry for the late reply. Thans for your proposal! However I 
don't think it is a good way to fix it like that. First of all, I think we 
should clear that what the result of union all/union should be. I still use 
some examples with duplicate columns of first union's child. There are two 
cases. 
   We have a `conflicting example`.
   sample query:
   select
   a,
   a
   from values (1, 1), (1, 2) as t1(a, b)
   UNION ALL
   SELECT
   c,
   d
   from values (2, 3), (2, 3) as t2(c, d)
   
   result is (1, 1), (1, 1), (3, 3), (3, 3)
   expected (1, 1), (1, 1), (2, 3), (2, 3)
   
   And we have a `non-conflicting` example like below
   sample query:
   select
   a,
   a
   from values (1, 1), (1, 2) as t1(a, b)
   UNION ALL
   SELECT
   c,
   c
   from values (2, 3), (2, 3) as t2(c, d)
   
   result is (1, 1), (1, 1), (2, 2), (2, 2)
   
   ---
   As you said, Union would break the assumption that the columns with the same 
attr id should output the same value. So I think it should alias as two 
different attr ids since their values could be different. Let's take 
`conflicting example` as the subquery, and we do **select a from subquery**, it 
is an expected behavior that the query would be broken since there are two 
columns with the same name but their values are different. 
   But whether the `non-conflicting example` should also be broken, that is the 
question. If we alias the output of union, it would complain about both two 
cases. Or we could carefully alias only when it is the `conflicting example`.
   
   P.S. I tested on PostgreSQL, it would complain about both two cases as it is 
`ambiguous`, so I think it alias the outputs of Union. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] chasingegg commented on pull request #35290: [SPARK-37865][SQL][3.0]Fix union bug when the first child of union has duplicate columns

Reply via email to