gengliangwang commented on code in PR #48852: URL: https://github.com/apache/spark/pull/48852#discussion_r1842932775
########## docs/sql-pipe-syntax.md: ########## @@ -0,0 +1,162 @@ +--- +layout: global +title: SQL Pipe Syntax +displayTitle: SQL Pipe Syntax +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +### Syntax + +#### Overview + +Apache Spark supports SQL pipe syntax which allows composing queries from combinations of operators. + +* Any query can have zero or more pipe operators as a suffix, delineated by the pipe character `|>`. +* Each pipe operator starts with one or more SQL keywords followed by its own grammar as described + in the table below. +* Most of these operators reuse existing grammar for standard SQL clauses. +* Operators can apply in any order, any number of times. + +`FROM <tableName>` is now a supported standalone query which behaves the same as +`TABLE <tableName>`. This provides a convenient starting place to begin a chained pipe SQL query, +although it is possible to add one or more pipe operators to the end of any valid Spark SQL query +with the same consistent behavior as written here. + +Please refer to the table at the end of this document for a full list of all supported operators +and their semantics. + +#### Example + +For example, this is query 13 from the TPC-H benchmark: + +```sql +SELECT c_count, COUNT(*) AS custdist +FROM + (SELECT c_custkey, COUNT(o_orderkey) c_count + FROM customer + LEFT OUTER JOIN orders ON c_custkey = o_custkey + AND o_comment NOT LIKE '%unusual%packages%' GROUP BY c_custkey + ) AS c_orders +GROUP BY c_count +ORDER BY custdist DESC, c_count DESC; +``` + +To write the same logic using SQL pipe operators, we express it like this: + +```sql +FROM customer +|> LEFT OUTER JOIN orders ON c_custkey = o_custkey + AND o_comment NOT LIKE '%unusual%packages%' +|> AGGREGATE COUNT(o_orderkey) c_count + GROUP BY c_custkey +|> AGGREGATE COUNT(*) AS custdist + GROUP BY c_count +|> ORDER BY custdist DESC, c_count DESC; +``` + +#### Projection + +The following ways to evaluate expressions within projections are supported. + +* `SELECT` produces a new table by evaluating the provided expressions.<br> + It is possible to use `DISTINCT` and `*` as needed.<br> + This works like the outermost `SELECT` in a table subquery in regular Spark SQL. +* `EXTEND` adds new columns to the input table by evaluating the provided expressions.<br> + This also preserves table aliases.<br> + This works like `SELECT *, new_column` in regular Spark SQL. +* `DROP` removes columns from the input table.<br> + This is similar to `SELECT * EXCEPT (column)` in regular Spark SQL. +* `SET` replaces column values from the input table.<br> + This is similar to `SELECT * REPLACE (expression AS column)` in regular Spark SQL. +* `AS` forwards the input table and introduces a new alias for each row. + +A major advantage of these projection features is that they support applying repeatedly to +incrementally compute new expressions based on previous ones in a composable way. No lateral column +references are needed here since each operator applies independently on its input table, regardless +of the order in which they appear. Each of these computed columns are then visible to use any +following operators. + +#### Aggregation + +Aggregation takes place differently using SQL pipe syntax as opposed to regular Spark SQL. + +* To perform full-table aggregation, use the `AGGREGATE` operator with a list of aggregate +expressions to evaluate.<br> + This returns one single row in the output table. +* To perform aggregation with grouping, use the `AGGREGATE` oeprator with a `GROUP BY` clause.<br> Review Comment: ```suggestion * To perform aggregation with grouping, use the `AGGREGATE` operator with a `GROUP BY` clause.<br> ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
