Re: [PR] [SPARK-50309][SQL] Add documentation for SQL pipe syntax [spark]

via GitHub Thu, 14 Nov 2024 14:11:35 -0800


gengliangwang commented on code in PR #48852:
URL: https://github.com/apache/spark/pull/48852#discussion_r1842932775



##########
docs/sql-pipe-syntax.md:
##########
@@ -0,0 +1,162 @@
+---
+layout: global
+title: SQL Pipe Syntax
+displayTitle: SQL Pipe Syntax
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Syntax
+
+#### Overview
+
+Apache Spark supports SQL pipe syntax which allows composing queries from 
combinations of operators.
+
+* Any query can have zero or more pipe operators as a suffix, delineated by 
the pipe character `|>`. 
+* Each pipe operator starts with one or more SQL keywords followed by its own 
grammar as described
+  in the table below.
+* Most of these operators reuse existing grammar for standard SQL clauses.
+* Operators can apply in any order, any number of times.
+
+`FROM <tableName>` is now a supported standalone query which behaves the same 
as
+`TABLE <tableName>`. This provides a convenient starting place to begin a 
chained pipe SQL query,
+although it is possible to add one or more pipe operators to the end of any 
valid Spark SQL query
+with the same consistent behavior as written here.
+
+Please refer to the table at the end of this document for a full list of all 
supported operators
+and their semantics.
+
+#### Example
+
+For example, this is query 13 from the TPC-H benchmark:
+
+```sql
+SELECT c_count, COUNT(*) AS custdist
+FROM
+  (SELECT c_custkey, COUNT(o_orderkey) c_count
+   FROM customer
+   LEFT OUTER JOIN orders ON c_custkey = o_custkey
+     AND o_comment NOT LIKE '%unusual%packages%' GROUP BY c_custkey
+  ) AS c_orders
+GROUP BY c_count
+ORDER BY custdist DESC, c_count DESC;
+```
+
+To write the same logic using SQL pipe operators, we express it like this:
+
+```sql
+FROM customer
+|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
+   AND o_comment NOT LIKE '%unusual%packages%'
+|> AGGREGATE COUNT(o_orderkey) c_count
+   GROUP BY c_custkey
+|> AGGREGATE COUNT(*) AS custdist
+   GROUP BY c_count
+|> ORDER BY custdist DESC, c_count DESC;
+```
+
+#### Projection
+
+The following ways to evaluate expressions within projections are supported.
+
+* `SELECT` produces a new table by evaluating the provided expressions.<br>
+  It is possible to use `DISTINCT` and `*` as needed.<br>
+  This works like the outermost `SELECT` in a table subquery in regular Spark 
SQL.
+* `EXTEND` adds new columns to the input table by evaluating the provided 
expressions.<br>
+  This also preserves table aliases.<br>
+  This works like `SELECT *, new_column` in regular Spark SQL.
+* `DROP` removes columns from the input table.<br>
+  This is similar to `SELECT * EXCEPT (column)` in regular Spark SQL.
+* `SET` replaces column values from the input table.<br>
+  This is similar to `SELECT * REPLACE (expression AS column)` in regular 
Spark SQL.
+* `AS` forwards the input table and introduces a new alias for each row.
+
+A major advantage of these projection features is that they support applying 
repeatedly to
+incrementally compute new expressions based on previous ones in a composable 
way. No lateral column
+references are needed here since each operator applies independently on its 
input table, regardless
+of the order in which they appear. Each of these computed columns are then 
visible to use any
+following operators.
+
+#### Aggregation
+
+Aggregation takes place differently using SQL pipe syntax as opposed to 
regular Spark SQL.
+
+* To perform full-table aggregation, use the `AGGREGATE` operator with a list 
of aggregate
+expressions to evaluate.<br>
+  This returns one single row in the output table.
+* To perform aggregation with grouping, use the `AGGREGATE` oeprator with a 
`GROUP BY` clause.<br>

Review Comment:
   ```suggestion
   * To perform aggregation with grouping, use the `AGGREGATE` operator with a 
`GROUP BY` clause.<br>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50309][SQL] Add documentation for SQL pipe syntax [spark]

Reply via email to