Re: [PR] [SPARK-49562][SQL] Add SQL pipe syntax for aggregation [spark]

via GitHub Wed, 23 Oct 2024 17:12:31 -0700


dtenedor commented on code in PR #48529:
URL: https://github.com/apache/spark/pull/48529#discussion_r1813268872



##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -765,7 +765,7 @@ temporalClause
 aggregationClause
     : GROUP BY groupingExpressionsWithGroupingAnalytics+=groupByClause
         (COMMA groupingExpressionsWithGroupingAnalytics+=groupByClause)*
-    | GROUP BY groupingExpressions+=expression (COMMA 
groupingExpressions+=expression)* (
+    | GROUP BY groupingExpressions+=namedExpression (COMMA 
groupingExpressions+=namedExpression)* (

Review Comment:
   Sure, good question. So the answer is that with SQL pipe semantics, the 
AGGREGATE operator's output attribute list comprise all the grouping 
expressions first (in order), then all the aggregate functions next (in order). 
This is the specification from the original research paper [1].
   
   The idea is that the list of expressions after the AGGREGATE keyword do not 
need to include any of the grouping keys; they may only comprise aggregate 
function calls. Then following operators can refer to the grouping keys 
(possibly using aliases) and/or aggregate functions as needed, including just 
`|> SELECT`ing a subset.
   
   So this example:
   
   ```
   | AGGREGATE max(i) GROUP BY j + 1 as i
   ```
   
   would be equivalent to this using classic SQL:
   
   ```
   SELECT j + 1 as i, max(i) FROM t
   GROUP BY j + 1
   ```
   
   <img width="905" alt="image" 
src="https://github.com/user-attachments/assets/de8c0bf2-7c33-4ac7-a946-d4f93162fc71";>
   
   [1] 
https://research.google/pubs/sql-has-problems-we-can-fix-them-pipe-syntax-in-sql/



##########
sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4:
##########
@@ -765,7 +765,7 @@ temporalClause
 aggregationClause
     : GROUP BY groupingExpressionsWithGroupingAnalytics+=groupByClause
         (COMMA groupingExpressionsWithGroupingAnalytics+=groupByClause)*
-    | GROUP BY groupingExpressions+=expression (COMMA 
groupingExpressions+=expression)* (
+    | GROUP BY groupingExpressions+=namedExpression (COMMA 
groupingExpressions+=namedExpression)* (

Review Comment:
   Sure, good question. So the answer is that with SQL pipe semantics, the 
AGGREGATE operator's output attribute list comprises all the grouping 
expressions first (in order), then all the aggregate functions next (in order). 
This is the specification from the original research paper [1].
   
   The idea is that the list of expressions after the AGGREGATE keyword do not 
need to include any of the grouping keys; they may only comprise aggregate 
function calls. Then following operators can refer to the grouping keys 
(possibly using aliases) and/or aggregate functions as needed, including just 
`|> SELECT`ing a subset.
   
   So this example:
   
   ```
   | AGGREGATE max(i) GROUP BY j + 1 as i
   ```
   
   would be equivalent to this using classic SQL:
   
   ```
   SELECT j + 1 as i, max(i) FROM t
   GROUP BY j + 1
   ```
   
   <img width="905" alt="image" 
src="https://github.com/user-attachments/assets/de8c0bf2-7c33-4ac7-a946-d4f93162fc71";>
   
   [1] 
https://research.google/pubs/sql-has-problems-we-can-fix-them-pipe-syntax-in-sql/



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-49562][SQL] Add SQL pipe syntax for aggregation [spark]

Reply via email to