cloud-fan commented on code in PR #37334: URL: https://github.com/apache/spark/pull/37334#discussion_r938976805
########## sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/explain.txt: ########## @@ -157,31 +157,31 @@ Input [2]: [s_state#14, sum#16] Keys [1]: [s_state#14] Functions [1]: [sum(UnscaledValue(ss_net_profit#10))] Aggregate Attributes [1]: [sum(UnscaledValue(ss_net_profit#10))#17] -Results [3]: [s_state#14, s_state#14, MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#18] +Results [3]: [s_state#14 AS s_state#18, s_state#14, MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#19] Review Comment: I'm surprised that this query plan works fine today (two `s_state#14`). Reading the source code a bit more, I think Spark is actually fine with duplicated attribute ids. It assumes columns with the same attr id always output same values, so it can safely bind attributes with the first match. See `BindReferences`. So the problem is still at `Union`. It's not fine to introduce duplicated attr ids in the first child of Union, as it will introduce duplicated attr ids in the output of Union, which causes wrong results. I think eventually we should make `Union` use new attr ids to build its output columns. As a bug fix, your first attempt to simply not remove alias in the first child of Union looks the best one: it's simple, and having a bit more unnecessary alias won't impact performance. Sorry for the back and forth! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
