[GitHub] [spark] beliefer opened a new pull request #26656: [SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression.

GitBox Sun, 24 Nov 2019 17:54:13 -0800

beliefer opened a new pull request #26656: [SPARK-27986][SQL] Support ANSI SQL 
filter clause for aggregate expression.
URL: https://github.com/apache/spark/pull/26656
 
 
   ### What changes were proposed in this pull request?
   The filter predicate for aggregate expression is an `ANSI SQL`.
   ```
   <aggregate function> ::=
   COUNT <left paren> <asterisk> <right paren> [ <filter clause> ]
   | <general set function> [ <filter clause> ]
   | <binary set function> [ <filter clause> ]
   | <ordered set function> [ <filter clause> ]
   | <array aggregate function> [ <filter clause> ]
   | <row pattern count function> [ <filter clause> ]
   ```
   There are some mainstream database support this syntax.
   **PostgreSQL:**
   
https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES
   For example:
   ```
   SELECT
     year,
     count(*) FILTER (WHERE gdp_per_capita >= 40000)
   FROM
     countries
   GROUP BY
     year
   ```
   ```
   SELECT
     year,
     code,
     gdp_per_capita,
     count(*) 
       FILTER (WHERE gdp_per_capita >= 40000) 
       OVER   (PARTITION BY year)
   FROM
     countries
   ```
   **jOOQ:**
   
https://blog.jooq.org/2014/12/30/the-awesome-postgresql-9-4-sql2003-filter-clause-for-aggregate-functions/
   
   **Notice:**
   This PR only support filter predicate without codegen. I will create another 
PR to support codegen.
   
   There are some show of the PR on my production environment.
   ```
   spark-sql> desc gja_test_partition;
   key     string  NULL
   value   string  NULL
   other   string  NULL
   col2    int     NULL
   # Partition Information
   # col_name      data_type       comment
   col2    int     NULL
   Time taken: 0.79 s
   ```
   ```
   spark-sql> select * from gja_test_partition;
   a       A       ao      1
   b       B       bo      1
   c       C       co      1
   d       D       do      1
   e       E       eo      2
   g       G       go      2
   h       H       ho      2
   j       J       jo      2
   f       F       fo      3
   k       K       ko      3
   l       L       lo      4
   i       I       io      4
   Time taken: 1.75 s
   ```
   ```
   spark-sql> select count(key), sum(col2) from gja_test_partition;
   12      26
   Time taken: 1.848 s
   ```
   ```
   spark-sql> select count(key) filter (where col2 > 1) from gja_test_partition;
   8
   Time taken: 2.926 s
   ```
   ```
   spark-sql> select sum(col2) filter (where col2 > 2) from gja_test_partition;
   14
   Time taken: 2.087 s
   ```
   ```
   spark-sql> select count(key) filter (where col2 > 1), sum(col2) filter 
(where col2 > 2) from gja_test_partition;
   8       14
   Time taken: 2.847 s
   ```
   ```
   spark-sql> select count(key), count(key) filter (where col2 > 1), sum(col2), 
sum(col2) filter (where col2 > 2) from gja_test_partition;
   12      8       26      14
   Time taken: 1.787 s
   ```
   ```
   spark-sql> desc student;
   id      int     NULL
   name    string  NULL
   sex     string  NULL
   class_id        int     NULL
   Time taken: 0.206 s
   ```
   ```
   spark-sql> select * from student;
   1       张三    man     1
   2       李四    man     1
   3       王五    man     2
   4       赵六    man     2
   5       钱小花  woman   1
   6       赵九红  woman   2
   7       郭丽丽  woman   2
   Time taken: 0.786 s
   ```
   ```
   spark-sql> select class_id, count(id), sum(id) from student group by 
class_id;
   1       3       8
   2       4       20
   Time taken: 18.783 s
   ```
   ```
   spark-sql> select class_id, count(id) filter (where sex = 'man'), sum(id) 
filter (where sex = 'woman') from student group by class_id;
   1       2       5
   2       2       13
   Time taken: 3.887 s
   ```
   
   ### Why are the changes needed?
   Add new SQL feature.
   
   
   ### Does this PR introduce any user-facing change?
   'No'.
   
   
   ### How was this patch tested?
   Exists UT and new UT.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] beliefer opened a new pull request #26656: [SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression.

Reply via email to