[jira] [Created] (SPARK-33726) Duplicate field names causes wrong answers during aggregation

Yian Liou (Jira) Wed, 09 Dec 2020 12:50:36 -0800

Yian Liou created SPARK-33726:
---------------------------------

             Summary: Duplicate field names causes wrong answers during 
aggregation
                 Key: SPARK-33726
                 URL: https://issues.apache.org/jira/browse/SPARK-33726
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.1, 2.4.4
            Reporter: Yian Liou



We saw this bug at Workday.

Duplicate field names for different fields can cause  
org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to 
return a fixed batch when it should have returned a variable batch leading to 
wrong results.

This example produces wrong results in the spark shell:

scala> sql("with T as (select id as a, -id as x from range(3)), U as (select id 
as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, 
min(b) as mb from T join U on a=b group by U.x, T.x").show
 
|*x*|*x*|*ma*|*mb*|
|-2|2|0|null|
|-1|1|null|1|
|0|0|0|0|

 instead of correct output : 
|*x*|*x*|*ma*|*mb*|
|0|0|0|0|
|-2|2|2|2|
|-1|1|1|1|

The issue can be solved by iterating over the fields themselves instead of 
field names. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33726) Duplicate field names causes wrong answers during aggregation

Reply via email to