cloud-fan commented on code in PR #37334:
URL: https://github.com/apache/spark/pull/37334#discussion_r938976805


##########
sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/explain.txt:
##########
@@ -157,31 +157,31 @@ Input [2]: [s_state#14, sum#16]
 Keys [1]: [s_state#14]
 Functions [1]: [sum(UnscaledValue(ss_net_profit#10))]
 Aggregate Attributes [1]: [sum(UnscaledValue(ss_net_profit#10))#17]
-Results [3]: [s_state#14, s_state#14, 
MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#18]
+Results [3]: [s_state#14 AS s_state#18, s_state#14, 
MakeDecimal(sum(UnscaledValue(ss_net_profit#10))#17,17,2) AS _w2#19]

Review Comment:
   I'm surprised that this query plan works fine today (two `s_state#14`). 
Reading the source code a bit more, I think Spark is actually fine with 
duplicated attribute ids. It assumes columns with the same attr id always 
output same values, so it can safely bind attributes with the first match. See 
`BindReferences`.
   
   So the problem is still at `Union`. It's not fine to introduce duplicated 
attr ids in the first child of Union, as it will introduce duplicated attr ids 
in the output of Union, which causes wrong results.
   
   I think eventually we should make `Union` use new attr ids to build its 
output columns. As a bug fix, your first attempt to simply not remove alias in 
the first child of Union looks the best one: it's simple, and having a bit more 
unnecessary alias won't impact performance. Sorry for the back and forth!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to