Yian Liou created SPARK-33726: --------------------------------- Summary: Duplicate field names causes wrong answers during aggregation Key: SPARK-33726 URL: https://issues.apache.org/jira/browse/SPARK-33726 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 2.4.4 Reporter: Yian Liou
We saw this bug at Workday. Duplicate field names for different fields can cause org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to return a fixed batch when it should have returned a variable batch leading to wrong results. This example produces wrong results in the spark shell: scala> sql("with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x").show |*x*|*x*|*ma*|*mb*| |-2|2|0|null| |-1|1|null|1| |0|0|0|0| instead of correct output : |*x*|*x*|*ma*|*mb*| |0|0|0|0| |-2|2|2|2| |-1|1|1|1| The issue can be solved by iterating over the fields themselves instead of field names. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org