[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect

Will Du (Jira) Sun, 11 Oct 2020 19:58:10 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-33116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Will Du updated SPARK-33116:
----------------------------
    Description: 
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink     | 8.90   | 8.90   | 8.90   | 8.90    | 8.900000   | 1        |
| drink     | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2        |
| drink     | 15.50  | 15.50  | 8.90   | 34.39   | 11.463333  | 3        |
| game      | 70.00  | 70.00  | 70.00  | 70.00   | 70.000000  | 1        |
| game      | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2        |
| toy       | 5.90   | 5.90   | 5.90   | 5.90    | 5.900000   | 1        |
| toy       | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2        |
7 rows selected (0.436 seconds)

Does it seem that we can only order by the columns after partition by clause?
I do not think there are such limitation in standard SQL.


  was:
Prepare the data

CREATE TABLE IF NOT EXISTS product_catalog (
name STRING,category STRING,location STRING,price DECIMAL(10,2));

INSERT OVERWRITE product_catalog VALUES 
('Nest Coffee', 'drink', 'Toronto', 15.5),
('Pepesi', 'drink', 'Toronto', 9.99),
('Hasimal', 'toy', 'Toronto', 5.9),
('Fire War', 'game', 'Toronto', 70.0),
('Final Fantasy', 'game', 'Montreal', 79.99),
('Lego Friends 15005', 'toy', 'Montreal', 12.99),
('Nesion Milk', 'drink', 'Montreal', 8.9);

1. Query without ORDER BY after PARTITION BY col,  the result is correct.

SELECT
category, price,
max(price) over(PARTITION BY category) as max_p,
min(price) over(PARTITION BY category) as min_p,
sum(price) over(PARTITION BY category) as sum_p,
avg(price) over(PARTITION BY category) as avg_p,
count(*) over(PARTITION BY category) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
| game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
| game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
| toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
| toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
7 rows selected (0.442 seconds)

1. Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
result is ok. Why other results are like that?

SELECT
category, price,
max(price) over(PARTITION BY category ORDER BY price) as max_p,
min(price) over(PARTITION BY category ORDER BY price) as min_p,
sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
count(*)   over(PARTITION BY category ORDER BY price) as count_w
FROM
product_catalog;

|| category    || price      || max_p  || min_p    || sum_p    || avg_p         
  || count_w   ||
| drink     | 8.90   | 8.90   | 8.90   | 8.90    | 8.900000   | 1        |
| drink     | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2        |
| drink     | 15.50  | 15.50  | 8.90   | 34.39   | 11.463333  | 3        |
| game      | 70.00  | 70.00  | 70.00  | 70.00   | 70.000000  | 1        |
| game      | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2        |
| toy       | 5.90   | 5.90   | 5.90   | 5.90    | 5.900000   | 1        |
| toy       | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2        |
7 rows selected (0.436 seconds)

Does it seem that we can only order by the columns after partition by clause?
I do not think there are such limitation in standard SQL.



> Spark SQL window function with order by cause result incorrect
> --------------------------------------------------------------
>
>                 Key: SPARK-33116
>                 URL: https://issues.apache.org/jira/browse/SPARK-33116
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.1
>            Reporter: Will Du
>            Priority: Major
>
> Prepare the data
> CREATE TABLE IF NOT EXISTS product_catalog (
> name STRING,category STRING,location STRING,price DECIMAL(10,2));
> INSERT OVERWRITE product_catalog VALUES 
> ('Nest Coffee', 'drink', 'Toronto', 15.5),
> ('Pepesi', 'drink', 'Toronto', 9.99),
> ('Hasimal', 'toy', 'Toronto', 5.9),
> ('Fire War', 'game', 'Toronto', 70.0),
> ('Final Fantasy', 'game', 'Montreal', 79.99),
> ('Lego Friends 15005', 'toy', 'Montreal', 12.99),
> ('Nesion Milk', 'drink', 'Montreal', 8.9);
> 1. Query without ORDER BY after PARTITION BY col,  the result is correct.
> SELECT
> category, price,
> max(price) over(PARTITION BY category) as max_p,
> min(price) over(PARTITION BY category) as min_p,
> sum(price) over(PARTITION BY category) as sum_p,
> avg(price) over(PARTITION BY category) as avg_p,
> count(*) over(PARTITION BY category) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink           | 8.90      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | drink           | 9.99      | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | drink           | 15.50    | 15.50    | 8.90 | 34.39 | 11.463333 | 3 |
> | game          | 79.99    | 79.99    | 70.00 | 149.99 | 74.995000 | 2 |
> | game          | 70.00    | 79.99 | 70.00 | 149.99 | 74.995000 | 2 |
> | toy              | 12.99    | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> | toy              | 5.90      | 12.99 | 5.90 | 18.89 | 9.445000 | 2 |
> 7 rows selected (0.442 seconds)
> 2 Query with ORDER BY after PARTITION BY col,  the result is NOT correct. Min 
> result is ok. Why other results are like that?
> SELECT
> category, price,
> max(price) over(PARTITION BY category ORDER BY price) as max_p,
> min(price) over(PARTITION BY category ORDER BY price) as min_p,
> sum(price) over(PARTITION BY category ORDER BY price) as sum_p,
> avg(price) over(PARTITION BY category ORDER BY price) as avg_p,
> count(*)   over(PARTITION BY category ORDER BY price) as count_w
> FROM
> product_catalog;
> || category    || price      || max_p  || min_p    || sum_p    || avg_p       
>     || count_w   ||
> | drink     | 8.90   | 8.90   | 8.90   | 8.90    | 8.900000   | 1        |
> | drink     | 9.99   | 9.99   | 8.90   | 18.89   | 9.445000   | 2        |
> | drink     | 15.50  | 15.50  | 8.90   | 34.39   | 11.463333  | 3        |
> | game      | 70.00  | 70.00  | 70.00  | 70.00   | 70.000000  | 1        |
> | game      | 79.99  | 79.99  | 70.00  | 149.99  | 74.995000  | 2        |
> | toy       | 5.90   | 5.90   | 5.90   | 5.90    | 5.900000   | 1        |
> | toy       | 12.99  | 12.99  | 5.90   | 18.89   | 9.445000   | 2        |
> 7 rows selected (0.436 seconds)
> Does it seem that we can only order by the columns after partition by clause?
> I do not think there are such limitation in standard SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33116) Spark SQL window function with order by cause result incorrect

Reply via email to