Michail Giannakopoulos created SPARK-31059:
----------------------------------------------
Summary: Spark's SQL "group by" local processing operator is
broken.
Key: SPARK-31059
URL: https://issues.apache.org/jira/browse/SPARK-31059
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.5, 2.4.3
Environment: Windows 10.
Reporter: Michail Giannakopoulos
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I
expect to see all the grouping columns being grouped together to the same
buckets. However, this is not the case.
Steps to reproduce:
1. Start spark-shell as follows:
bin\spark-shell.cmd --master local[4] --conf
spark.sql.catalogImplementation=in-memory
2. Load the attached csv file:
val gosales = spark.read.format("csv").option("header",
"true").option("inferSchema",
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
3. Create a temp view:
gosales.createOrReplaceTempView("gosales")
4. Execute the following sql statement:
spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM
`gosales` GROUP BY `Product line`, `Order method type`").show()
Output:
+--------------------+-----------------+----------------------------+
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+--------------------+-----------------+----------------------------+
| Golf Equipment| E-mail| 92.25|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Fax| null|
| Golf Equipment| Telephone| 123.0|
| Camping Equipment| Special| null|
| Outdoor Protection| Telephone| 34218.19|
|Mountaineering Eq...| Mail| 0.0|
| Camping Equipment| Web| 32469.03|
|Personal Accessories| Fax| 3318.7|
| Golf Equipment| Sales visit| 143.5|
|Mountaineering Eq...| Telephone| null|
|Mountaineering Eq...| E-mail| null|
| Outdoor Protection| Sales visit| 20522.42|
| Outdoor Protection| Fax| 5857.54|
|Personal Accessories| E-mail| 26679.640000000003|
|Mountaineering Eq...| Fax| null|
| Outdoor Protection| Web| 340836.85000000003|
| Golf Equipment| Special| 0.0|
| Outdoor Protection| E-mail| 28505.93|
| Golf Equipment| Web| 3034.0|
+--------------------+-----------------+----------------------------+
Expected output:
+--------------------+-----------------+----------------------------+
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+--------------------+-----------------+----------------------------+
| Golf Equipment| E-mail| 92.25|
| Golf Equipment| Fax| null|
| Golf Equipment| Mail| 0.0|
| Golf Equipment| Sales visit| 143.5|
| Golf Equipment| Special| 0.0|
| Golf Equipment| Telephone| 123.0|
| Golf Equipment| Web| 3034.0|
| Camping Equipment| E-mail| 1303.3999999999999|
| Camping Equipment| Fax| null|
| Camping Equipment| Sales visit| 4754.87|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Special| null|
| Camping Equipment| Telephone| 5169.65|
| Camping Equipment| Web| 32469.03|
|Mountaineering Eq...| E-mail| null|
|Mountaineering Eq...| Fax| null|
|Mountaineering Eq...| Mail| 0.0|
|Mountaineering Eq...| Special| null|
|Mountaineering Eq...| Sales visit| null|
|Mountaineering Eq...| Telephone| null|
+--------------------+-----------------+----------------------------+
Notice how all the grouping columns should be bucketed together without being
in order.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]