[ 
https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michail Giannakopoulos updated SPARK-31059:
-------------------------------------------
    Description: 
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
 1. Start spark-shell as follows:
 bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
 2. Load the attached csv file:
 val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
 3. Create a temp view:
 gosales.createOrReplaceTempView("gosales")
 4. Execute the following sql statement:
 spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
 +-----------------------+--------------++----------------------------
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+-----------------------+--------------++----------------------------
|Golf Equipment|E-mail|92.25|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Fax|null|
|Golf Equipment|Telephone|123.0|
|Camping Equipment|Special|null|
|Outdoor Protection|Telephone|34218.19|
|Mountaineering Eq...|Mail|0.0|
|Camping Equipment|Web|32469.03|
|Personal Accessories|Fax|3318.7|
|Golf Equipment|Sales visit|143.5|
|Mountaineering Eq...|Telephone|null|
|Mountaineering Eq...|E-mail|null|
|Outdoor Protection|Sales visit|20522.42|
|Outdoor Protection|Fax|5857.54|
|Personal Accessories|E-mail|26679.640000000003|
|Mountaineering Eq...|Fax|null|
|Outdoor Protection|Web|340836.85000000003|
|Golf Equipment|Special|0.0|
|Outdoor Protection|E-mail|28505.93|
|Golf Equipment|Web|3034.0|

+-----------------------+--------------++----------------------------

Expected output:
 +-----------------------+--------------++----------------------------
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+-----------------------+--------------++----------------------------
|Golf Equipment|E-mail|92.25|
|Golf Equipment|Fax|null|
|Golf Equipment|Mail|0.0|
|Golf Equipment|Sales visit|143.5|
|Golf Equipment|Special|0.0|
|Golf Equipment|Telephone|123.0|
|Golf Equipment|Web|3034.0|
|Camping Equipment|E-mail|1303.3999999999999|
|Camping Equipment|Fax|null|
|Camping Equipment|Sales visit|4754.87|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Special|null|
|Camping Equipment|Telephone|5169.65|
|Camping Equipment|Web|32469.03|
|Mountaineering Eq...|E-mail|null|
|Mountaineering Eq...|Fax|null|
|Mountaineering Eq...|Mail|0.0|
|Mountaineering Eq...|Special|null|
|Mountaineering Eq...|Sales visit|null|
|Mountaineering Eq...|Telephone|null|

+-----------------------+--------------++----------------------------

Notice how in the expected output all the grouping columns should be bucketed 
together without necessarily being in order, which is not the case with the 
output that spark produces.

  was:
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I 
expect to see all the grouping columns being grouped together to the same 
buckets. However, this is not the case.

Steps to reproduce:
 1. Start spark-shell as follows:
 bin\spark-shell.cmd --master local[4] --conf 
spark.sql.catalogImplementation=in-memory
 2. Load the attached csv file:
 val gosales = spark.read.format("csv").option("header", 
"true").option("inferSchema", 
"true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
 3. Create a temp view:
 gosales.createOrReplaceTempView("gosales")
 4. Execute the following sql statement:
 spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
`gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
 +----------------------+---------------++----------------------------
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+----------------------+---------------++----------------------------
|Golf Equipment|E-mail|92.25|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Fax|null|
|Golf Equipment|Telephone|123.0|
|Camping Equipment|Special|null|
|Outdoor Protection|Telephone|34218.19|
|Mountaineering Eq...|Mail|0.0|
|Camping Equipment|Web|32469.03|
|Personal Accessories|Fax|3318.7|
|Golf Equipment|Sales visit|143.5|
|Mountaineering Eq...|Telephone|null|
|Mountaineering Eq...|E-mail|null|
|Outdoor Protection|Sales visit|20522.42|
|Outdoor Protection|Fax|5857.54|
|Personal Accessories|E-mail|26679.640000000003|
|Mountaineering Eq...|Fax|null|
|Outdoor Protection|Web|340836.85000000003|
|Golf Equipment|Special|0.0|
|Outdoor Protection|E-mail|28505.93|
|Golf Equipment|Web|3034.0|

+----------------------+---------------++----------------------------

Expected output:
 +----------------------+---------------++----------------------------
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+----------------------+---------------++----------------------------
|Golf Equipment|E-mail|92.25|
|Golf Equipment|Fax|null|
|Golf Equipment|Mail|0.0|
|Golf Equipment|Sales visit|143.5|
|Golf Equipment|Special|0.0|
|Golf Equipment|Telephone|123.0|
|Golf Equipment|Web|3034.0|
|Camping Equipment|E-mail|1303.3999999999999|
|Camping Equipment|Fax|null|
|Camping Equipment|Sales visit|4754.87|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Special|null|
|Camping Equipment|Telephone|5169.65|
|Camping Equipment|Web|32469.03|
|Mountaineering Eq...|E-mail|null|
|Mountaineering Eq...|Fax|null|
|Mountaineering Eq...|Mail|0.0|
|Mountaineering Eq...|Special|null|
|Mountaineering Eq...|Sales visit|null|
|Mountaineering Eq...|Telephone|null|

+----------------------+---------------++----------------------------

Notice how in the expected output all the grouping columns should be bucketed 
together without necessarily being in order, which is not the case with output 
that spark produces.


> Spark's SQL "group by" local processing operator is broken.
> -----------------------------------------------------------
>
>                 Key: SPARK-31059
>                 URL: https://issues.apache.org/jira/browse/SPARK-31059
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.3, 2.4.5
>         Environment: Windows 10.
>            Reporter: Michail Giannakopoulos
>            Priority: Blocker
>         Attachments: SampleFile_GOSales.csv
>
>
> When applying "GROUP BY" processing operator (without an "ORDER BY" clause), 
> I expect to see all the grouping columns being grouped together to the same 
> buckets. However, this is not the case.
> Steps to reproduce:
>  1. Start spark-shell as follows:
>  bin\spark-shell.cmd --master local[4] --conf 
> spark.sql.catalogImplementation=in-memory
>  2. Load the attached csv file:
>  val gosales = spark.read.format("csv").option("header", 
> "true").option("inferSchema", 
> "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
>  3. Create a temp view:
>  gosales.createOrReplaceTempView("gosales")
>  4. Execute the following sql statement:
>  spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM 
> `gosales` GROUP BY `Product line`, `Order method type`").show()
> Output: 
>  +-----------------------+--------------++----------------------------
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +-----------------------+--------------++----------------------------
> |Golf Equipment|E-mail|92.25|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Fax|null|
> |Golf Equipment|Telephone|123.0|
> |Camping Equipment|Special|null|
> |Outdoor Protection|Telephone|34218.19|
> |Mountaineering Eq...|Mail|0.0|
> |Camping Equipment|Web|32469.03|
> |Personal Accessories|Fax|3318.7|
> |Golf Equipment|Sales visit|143.5|
> |Mountaineering Eq...|Telephone|null|
> |Mountaineering Eq...|E-mail|null|
> |Outdoor Protection|Sales visit|20522.42|
> |Outdoor Protection|Fax|5857.54|
> |Personal Accessories|E-mail|26679.640000000003|
> |Mountaineering Eq...|Fax|null|
> |Outdoor Protection|Web|340836.85000000003|
> |Golf Equipment|Special|0.0|
> |Outdoor Protection|E-mail|28505.93|
> |Golf Equipment|Web|3034.0|
> +-----------------------+--------------++----------------------------
> Expected output:
>  +-----------------------+--------------++----------------------------
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +-----------------------+--------------++----------------------------
> |Golf Equipment|E-mail|92.25|
> |Golf Equipment|Fax|null|
> |Golf Equipment|Mail|0.0|
> |Golf Equipment|Sales visit|143.5|
> |Golf Equipment|Special|0.0|
> |Golf Equipment|Telephone|123.0|
> |Golf Equipment|Web|3034.0|
> |Camping Equipment|E-mail|1303.3999999999999|
> |Camping Equipment|Fax|null|
> |Camping Equipment|Sales visit|4754.87|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Special|null|
> |Camping Equipment|Telephone|5169.65|
> |Camping Equipment|Web|32469.03|
> |Mountaineering Eq...|E-mail|null|
> |Mountaineering Eq...|Fax|null|
> |Mountaineering Eq...|Mail|0.0|
> |Mountaineering Eq...|Special|null|
> |Mountaineering Eq...|Sales visit|null|
> |Mountaineering Eq...|Telephone|null|
> +-----------------------+--------------++----------------------------
> Notice how in the expected output all the grouping columns should be bucketed 
> together without necessarily being in order, which is not the case with the 
> output that spark produces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to