Sofia created SPARK-27282:
-----------------------------
Summary: Spark incorrect results when using UNION with GROUP BY
clause
Key: SPARK-27282
URL: https://issues.apache.org/jira/browse/SPARK-27282
Project: Spark
Issue Type: Bug
Components: Spark Shell, Spark Submit, SQL
Affects Versions: 2.3.2
Environment: I'm using :
IntelliJ IDEA ==> 2018.1.4
spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
scala ==> 2.11.8
Reporter: Sofia
When using UNION clause after a GROUP BY clause in spark, the results obtained
are wrong.
The following example explicit this issue:
{code:java}
CREATE TABLE test_un (
col1 varchar(255),
col2 varchar(255),
col3 varchar(255),
col4 varchar(255)
);
INSERT INTO test_un (col1, col2, col3, col4)
VALUES (1,1,2,4),
(1,1,2,4),
(1,1,3,5),
(2,2,2,null);
{code}
I used the following code :
{code:java}
val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
val y = x
.filter(col("col4")isNotNull)
.groupBy("col1", "col2","col3")
.agg(count(col("col3")).alias("cnt"))
.withColumn("col_name", lit("col3"))
.select(col("col1"), col("col2"),
col("col_name"),col("col3").alias("col_value"), col("cnt"))
val z = x
.filter(col("col4")isNotNull)
.groupBy("col1", "col2","col4")
.agg(count(col("col4")).alias("cnt"))
.withColumn("col_name", lit("col4"))
.select(col("col1"), col("col2"),
col("col_name"),col("col4").alias("col_value"), col("cnt"))
y.union(z).show()
{code}
And i obtained the following results:
||col1||col2||col_name||col_value||cnt||
|1|1|col3|5|1|
|1|1|col3|4|2|
|1|1|col4|5|1|
|1|1|col4|4|2|
Expected results:
||col1||col2||col_name||col_value||cnt||
|1|1|col3|3|1|
|1|1|col3|2|2|
|1|1|col4|4|2|
|1|1|col4|5|1|
But when i remove the last row of the table, i obtain the correct results.
{code:java}
(2,2,2,null){code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]