DISTRIBUTE BY [spark]

via GitHub Fri, 25 Oct 2024 09:27:11 -0700


Angryrou commented on code in PR #48413:
URL: https://github.com/apache/spark/pull/48413#discussion_r1817004266



##########
sql/core/src/test/resources/sql-tests/inputs/pipe-operators.sql:
##########
@@ -571,6 +583,97 @@ table t
 table t
 |> union all table st;
 
+-- Sorting and repartitioning operators: positive tests.
+--------------------------------------------------------
+
+-- Order by.
+table t
+|> order by x;
+
+-- Order by with a table subquery.
+(select * from t)
+|> order by x;
+
+-- Order by with a VALUES list.
+values (0, 'abc') tab(x, y)
+|> order by x;
+
+-- Limit.
+table t
+|> order by x
+|> limit 1;
+
+-- Limit with offset.
+table t
+|> where x = 1
+|> select y
+|> limit 2 offset 1;
+
+-- Offset is allowed without limit.
+table t
+|> where x = 1
+|> select y
+|> offset 1;
+
+-- LIMIT ALL and OFFSET 0 are equivalent to no LIMIT or OFFSET clause, 
respectively.
+table t
+|> limit all offset 0;
+
+-- Distribute by.
+table t
+|> distribute by x;
+
+-- Cluster by.
+table t
+|> cluster by x;
+
+-- Sort and distribute by.
+table t
+|> sort by x distribute by x;
+
+-- It is possible to apply a final ORDER BY clause on the result of a query 
containing pipe
+-- operators.
+table t
+|> order by x desc
+order by y;
+
+-- Sorting and repartitioning operators: negative tests.
+--------------------------------------------------------
+
+-- Multiple order by clauses are not supported in the same pipe operator.
+-- We add an extra "ORDER BY y" clause at the end in this test to show that 
the "ORDER BY x + y"
+-- clause was consumed end the of the final query, not as part of the pipe 
operator.
+table t
+|> order by x desc order by x + y
+order by y;
+
+-- The ORDER BY clause may only refer to column names from the previous input 
relation.
+table t
+|> select 1 + 2 as result
+|> order by x;
+
+-- The DISTRIBUTE BY clause may only refer to column names from the previous 
input relation.
+table t
+|> select 1 + 2 as result
+|> distribute by x;
+
+-- Combinations of multiple ordering and limit clauses are not supported.
+table t
+|> order by x limit 1;
+
+-- ORDER BY and SORT BY are not supported at the same time.
+table t
+|> order by x sort by x;
+
+-- The WINDOW clause is not supported yet.
+table windowTestData
+|> window w as (partition by cte order by val)

Review Comment:
   Hi Daniel @dtenedor , I noticed that this window clause in Spark differs 
from what’s described in the original paper and documentation. Could you share 
your thoughts on this?
   
   The 
[documentation](https://github.com/google/zetasql/blob/master/docs/pipe-syntax.md#window-pipe-operator
   ) specifies that a window operator should always include a window function 
with an OVER clause. However, in Spark's syntax, the window operator only 
returns a window definition without requiring an OVER clause.
   
   I think it makes sense to keep the existing window syntax (as shown in this 
example) since the Extend clause will cover the window operator’s functionality 
as described in the paper. However, I’d like to confirm the expected behavior 
of the window clause in Spark SQL before proceeding with a PR.
   
   Thanks in advance!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-49558][SQL] Add SQL pipe syntax for LIMIT/OFFSET and ORDER/SORT/CLUSTER/DISTRIBUTE BY [spark]

Reply via email to