[spark] branch master updated: [SPARK-27974][SQL] Support ANSI Aggregate Function: array_agg

wenchen Fri, 17 Dec 2021 03:58:01 -0800

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 026158a  [SPARK-27974][SQL] Support ANSI Aggregate Function: array_agg
026158a is described below

commit 026158a554432cee8736f3595f83d86c5362f446
Author: Jiaan Geng <[email protected]>
AuthorDate: Fri Dec 17 19:56:19 2021 +0800

    [SPARK-27974][SQL] Support ANSI Aggregate Function: array_agg
    
    ### What changes were proposed in this pull request?
    This PR used to support ANSI aggregate Function: `array_agg`.
    
    The mainstream database supports `array_agg` show below:
    **Presto**
    https://prestodb.io/docs/current/functions/aggregate.html
    **Teradata**
    https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/Xffi_w_BZeSoaXg6Fd76Qw
    **Snowflake**
    https://docs.snowflake.com/en/sql-reference/functions/array_agg.html
    **Oracle**
    
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/JSON_ARRAYAGG.html#GUID-6D56077D-78DE-4CC0-9498-225DDC42E054
    **H2**
    http://www.h2database.com/html/functions-aggregate.html#array_agg
    **Postgresql**
    https://www.postgresql.org/docs/8.4/functions-aggregate.html
    **Maria**
    https://mariadb.com/kb/en/json_arrayagg/
    **Intersystems**
    
https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RSQL_jsonarrayagg
    **MySQL**
    https://www.mysqltutorial.org/mysql-aggregate-functions.aspx
    **Druid**
    https://druid.apache.org/docs/latest/querying/sql.html#aggregation-functions
    
    ### Why are the changes needed?
    `array_agg` is very useful.
    
    ### Does this PR introduce _any_ user-facing change?
    'No'. New feature.
    
    ### How was this patch tested?
    New tests.
    
    Closes #34866 from beliefer/SPARK-27974.
    
    Authored-by: Jiaan Geng <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
---
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  1 +
 .../sql-functions/sql-expression-schema.md         |  3 ++-
 .../test/resources/sql-tests/inputs/group-by.sql   | 14 +++++++++++
 .../inputs/postgreSQL/aggregates_part3.sql         |  2 +-
 .../sql-tests/inputs/postgreSQL/window_part4.sql   |  2 +-
 .../resources/sql-tests/results/group-by.sql.out   | 29 +++++++++++++++++++++-
 6 files changed, 47 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index 502d32b..13e755c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -446,6 +446,7 @@ object FunctionRegistry {
     expression[VariancePop]("var_pop"),
     expression[VarianceSamp]("var_samp"),
     expression[CollectList]("collect_list"),
+    expression[CollectList]("array_agg", true),
     expression[CollectSet]("collect_set"),
     expression[CountMinSketchAgg]("count_min_sketch"),
     expression[BoolAnd]("every", true),
diff --git a/sql/core/src/test/resources/sql-functions/sql-expression-schema.md 
b/sql/core/src/test/resources/sql-functions/sql-expression-schema.md
index 94d93f1..07e1d00 100644
--- a/sql/core/src/test/resources/sql-functions/sql-expression-schema.md
+++ b/sql/core/src/test/resources/sql-functions/sql-expression-schema.md
@@ -1,6 +1,6 @@
 <!-- Automatically generated by ExpressionsSchemaSuite -->
 ## Summary
-  - Number of queries: 374
+  - Number of queries: 375
   - Number of expressions that missing example: 12
   - Expressions missing examples: 
bigint,binary,boolean,date,decimal,double,float,int,smallint,string,timestamp,tinyint
 ## Schema of Built-in Functions
@@ -341,6 +341,7 @@
 | org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr | any | SELECT 
any(col) FROM VALUES (true), (false), (false) AS tab(col) | 
struct<any(col):boolean> |
 | org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr | bool_or | 
SELECT bool_or(col) FROM VALUES (true), (false), (false) AS tab(col) | 
struct<bool_or(col):boolean> |
 | org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr | some | SELECT 
some(col) FROM VALUES (true), (false), (false) AS tab(col) | 
struct<some(col):boolean> |
+| org.apache.spark.sql.catalyst.expressions.aggregate.CollectList | array_agg 
| SELECT array_agg(col) FROM VALUES (1), (2), (1) AS tab(col) | 
struct<collect_list(col):array<int>> |
 | org.apache.spark.sql.catalyst.expressions.aggregate.CollectList | 
collect_list | SELECT collect_list(col) FROM VALUES (1), (2), (1) AS tab(col) | 
struct<collect_list(col):array<int>> |
 | org.apache.spark.sql.catalyst.expressions.aggregate.CollectSet | collect_set 
| SELECT collect_set(col) FROM VALUES (1), (2), (1) AS tab(col) | 
struct<collect_set(col):array<int>> |
 | org.apache.spark.sql.catalyst.expressions.aggregate.Corr | corr | SELECT 
corr(c1, c2) FROM VALUES (3, 2), (3, 3), (6, 4) as tab(c1, c2) | 
struct<corr(c1, c2):double> |
diff --git a/sql/core/src/test/resources/sql-tests/inputs/group-by.sql 
b/sql/core/src/test/resources/sql-tests/inputs/group-by.sql
index 0a50d2f..1c29177 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/group-by.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/group-by.sql
@@ -213,3 +213,17 @@ SELECT regr_count(y, x) FROM testRegression;
 SELECT regr_count(y, x) FROM testRegression WHERE x IS NOT NULL;
 SELECT k, count(*), regr_count(y, x) FROM testRegression GROUP BY k;
 SELECT k, count(*) FILTER (WHERE x IS NOT NULL), regr_count(y, x) FROM 
testRegression GROUP BY k;
+
+-- SPARK-27974: Support ANSI Aggregate Function: array_agg
+SELECT
+  collect_list(col),
+  array_agg(col)
+FROM VALUES
+  (1), (2), (1) AS tab(col);
+SELECT
+  a,
+  collect_list(b),
+  array_agg(b)
+FROM VALUES
+  (1,4),(2,3),(1,4),(2,4) AS v(a,b)
+GROUP BY a;
\ No newline at end of file
diff --git 
a/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/aggregates_part3.sql 
b/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/aggregates_part3.sql
index 657ea59..41fd4de 100644
--- 
a/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/aggregates_part3.sql
+++ 
b/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/aggregates_part3.sql
@@ -73,7 +73,7 @@ select max(min(unique1)) from tenk1;
 -- drop table t2;
 -- drop table t3;
 
--- [SPARK-27974] Add built-in Aggregate Function: array_agg
+-- [SPARK-28664] ORDER BY in aggregate function
 --
 -- Test combinations of DISTINCT and/or ORDER BY
 --
diff --git 
a/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/window_part4.sql 
b/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/window_part4.sql
index 64ba8e3..dbb309a63 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/window_part4.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/window_part4.sql
@@ -388,7 +388,7 @@ FROM (VALUES(1,1),(2,2),(3,(cast('nan' as 
int))),(4,3),(5,4)) t(a,b);
 -- Tests for problems with failure to walk or mutate expressions
 -- within window frame clauses.
 
--- [SPARK-27974] Add built-in Aggregate Function: array_agg
+-- [SPARK-37612] Support window frame ORDER BY i ROWS BETWEEN (('foo' < 
'foobar')::integer) PRECEDING AND CURRENT ROW
 -- test walker (fails with collation error if expressions are not walked)
 -- SELECT array_agg(i) OVER w
 --   FROM range(1,6) i
diff --git a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out 
b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out
index ff4fb48..73708f9 100644
--- a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out
@@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
--- Number of queries: 71
+-- Number of queries: 73
 
 
 -- !query
@@ -735,3 +735,30 @@ struct<k:int,count(1) FILTER (WHERE (x IS NOT 
NULL)):bigint,regr_count(y, x):big
 -- !query output
 1      0       0
 2      3       3
+
+
+-- !query
+SELECT
+  collect_list(col),
+  array_agg(col)
+FROM VALUES
+  (1), (2), (1) AS tab(col)
+-- !query schema
+struct<collect_list(col):array<int>,collect_list(col):array<int>>
+-- !query output
+[1,2,1]        [1,2,1]
+
+
+-- !query
+SELECT
+  a,
+  collect_list(b),
+  array_agg(b)
+FROM VALUES
+  (1,4),(2,3),(1,4),(2,4) AS v(a,b)
+GROUP BY a
+-- !query schema
+struct<a:int,collect_list(b):array<int>,collect_list(b):array<int>>
+-- !query output
+1      [4,4]   [4,4]
+2      [3,4]   [3,4]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch master updated: [SPARK-27974][SQL] Support ANSI Aggregate Function: array_agg

Reply via email to